用java做抓取的时候免不了要用到多线程的了,因为要同时抓取多个网站或一条线程抓取一个网站的话实在太慢,而且有时一条线程抓取同一个网站的话也比较浪费CPU资源。要用到多线程的等方面,也就免不了对线程的控制或用到线程池。
我在做我们现在的那一个抓取框架的时候,就曾经用过java.util.concurrent.ExecutorService作为线程池,关于ExecutorService的使用代码大概如下:
java.util.concurrent.Executors类的API提供大量创建连接池的静态方法:1.固定大小的线程池:
package BackStage; import java.util.concurrent.Executors; import java.util.concurrent.ExecutorService; public class JavaThreadPool { public static void main(String[] args) { // 创建一个可重用固定线程数的线程池 ExecutorService pool = Executors.newFixedThreadPool(2); // 创建实现了Runnable接口对象,Thread对象当然也实现了Runnable接口 Thread t1 = new MyThread(); Thread t2 = new MyThread(); Thread t3 = new MyThread(); Thread t4 = new MyThread(); Thread t5 = new MyThread(); // 将线程放入池中进行执行 pool.execute(t1); pool.execute(t2); pool.execute(t3); pool.execute(t4); pool.execute(t5); // 关闭线程池 pool.shutdown(); } } class MyThread extends Thread { @Override public void run() { System.out.println(Thread.currentThread().getName() + "正在执行。。。"); } }
后来发现ExecutorService的功能没有想像中的那么好,而且最多只是提供一个线程的容器而然,所以后来我用改用了java.lang.ThreadGroup,ThreadGroup有很多优势,最重要的一点就是它可以对线程进行遍历,知道那些线程已经运行完毕,还有那些线程在运行。关于ThreadGroup的使用代码如下:
class MyThread extends Thread { boolean stopped; MyThread(ThreadGroup tg, String name) { super(tg, name); stopped = false; } public void run() { System.out.println(Thread.currentThread().getName() + " starting."); try { for (int i = 1; i < 1000; i++) { System.out.print("."); Thread.sleep(250); synchronized (this) { if (stopped) break; } } } catch (Exception exc) { System.out.println(Thread.currentThread().getName() + " interrupted."); } System.out.println(Thread.currentThread().getName() + " exiting."); } synchronized void myStop() { stopped = true; } } public class Main { public static void main(String args[]) throws Exception { ThreadGroup tg = new ThreadGroup("My Group"); MyThread thrd = new MyThread(tg, "MyThread #1"); MyThread thrd2 = new MyThread(tg, "MyThread #2"); MyThread thrd3 = new MyThread(tg, "MyThread #3"); thrd.start(); thrd2.start(); thrd3.start(); Thread.sleep(1000); System.out.println(tg.activeCount() + " threads in thread group."); Thread thrds[] = new Thread[tg.activeCount()]; tg.enumerate(thrds); for (Thread t : thrds) System.out.println(t.getName()); thrd.myStop(); Thread.sleep(1000); System.out.println(tg.activeCount() + " threads in tg."); tg.interrupt(); } }
由以上的代码可以看出:ThreadGroup比ExecutorService多以下几个优势
1.ThreadGroup可以遍历线程,知道那些线程已经运行完毕,那些还在运行
2.可以通过ThreadGroup.activeCount知道有多少线程从而可以控制插入的线程数