Hadoop Job使用第三方依赖jar文件

当我们实现了一个Hadoop MapReduce Job以后,而这个Job可能又依赖很多外部的jar文件,在Hadoop集群上运行时,有时会出现找不到具体Class的异常。出现这种问题,基本上就是在Hadoop Job执行过程中,没有从执行的上下文中找到对应的jar文件(实际是unjar的目录,目录里面是对应的Class文件)。所以,我们自然而然想到,正确配置好对应的classpath,MapReduce Job运行时就能够找到。
有两种方式可以更好地实现,一种是设置HADOOP_CLASSPATH,将Job所依赖的jar文件加载到HADOOP_CLASSPATH,这种配置只针对该Job生效,Job结束之后HADOOP_CLASSPATH会被清理;另一种方式是,直接在构建代码的时候,将依赖jar文件与Job代码打成一个jar文件,这种方式可能会使得最终的jar文件比较大,但是结合一些代码构建工具,如Maven,可以在依赖控制方面保持一个Job一个依赖的构建配置,便于管理。下面,我们分别说明这两种方式。

设置HADOOP_CLASSPATH

比如,我们有一个使用HBase的应用,操作HBase数据库中表,肯定需要ZooKeeper,所以对应的jar文件的位置都要设置正确,让运行时Job能够检索并加载。
Hadoop实现里面,有个辅助工具类org.apache.hadoop.util.GenericOptionsParser,能够帮助我们加载对应的文件到classpath中,操作比较容易一些。
下面我们是我们实现的一个例子,程序执行入口的类,代码如下所示:

01 package org.shirdrn.kodz.inaction.hbase.job.importing;
02
03 import java.io.IOException;
04 import java.net.URISyntaxException;
05
06 import org.apache.hadoop.conf.Configuration;
07 import org.apache.hadoop.fs.Path;
08 import org.apache.hadoop.hbase.HBaseConfiguration;
09 import org.apache.hadoop.hbase.client.Put;
10 import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
11 import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
12 import org.apache.hadoop.mapreduce.Job;
13 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
14 import org.apache.hadoop.util.GenericOptionsParser;
15
16 /**
17 * Table DDL: create 't_sub_domains', 'cf_basic', 'cf_status'
18 * <pre>
19 * cf_basic:domain cf_basic:len
20 * cf_status:status cf_status:live
21 * </pre>
22 *
23 * @author shirdrn
24 */
25 public class DataImporter {
26
27 public static void main(String[] args)
28 throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException {
29
30 Configuration conf = HBaseConfiguration.create();
31 String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
32
33 assert(otherArgs.length == 2);
34
35 if(otherArgs.length < 2) {
36 System.err.println("Usage: \n" +
37 " ImportDataDriver -libjars <jar1>[,<jar2>...[,<jarN>]] <tableName> <input>");
38 System.exit(1);
39 }
40 String tableName = otherArgs[0].trim();
41 String input = otherArgs[1].trim();
42
43 // set table columns
44 conf.set("table.cf.family", "cf_basic");
45 conf.set("table.cf.qualifier.fqdn", "domain");
46 conf.set("table.cf.qualifier.timestamp", "create_at");
47
48 Job job = new Job(conf, "Import into HBase table");
49 job.setJarByClass(DataImporter.class);
50 job.setMapperClass(ImportFileLinesMapper.class);
51 job.setOutputFormatClass(TableOutputFormat.class);
52
53 job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableName);
54 job.setOutputKeyClass(ImmutableBytesWritable.class);
55 job.setOutputValueClass(Put.class);
56
57 job.setNumReduceTasks(0);
58
59 FileInputFormat.addInputPath(job, new Path(input));
60
61 int exitCode = job.waitForCompletion(true) ? 0 : 1;
62 System.exit(exitCode);
63 }
64
65 }

可以看到,我们可以通过-libjars选项来指定该Job运行所依赖的第三方jar文件,具体使用方法,说明如下:

  • 第一步:设置环境变量

我们修改.bashrc文件,增加如下配置内容:

1 export HADOOP_HOME=/opt/stone/cloud/hadoop-1.0.3
2 export PATH=$PATH:$HADOOP_HOME/bin
3 export HBASE_HOME=/opt/stone/cloud/hbase-0.94.1
4 export PATH=$PATH:$HBASE_HOME/bin
5 export ZK_HOME=/opt/stone/cloud/zookeeper-3.4.3

不要忘记要使当前的配置生效:

1 . .bashrc
2
3 source .bashrc

这样就可以方便地引用外部的jar文件了。

  • 第二步:确定Job依赖的jar文件列表

上面提到,我们要使用HBase,需要HBase和ZooKeeper的相关jar文件,用到的文件如下所示:

1 HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.94.1.jar:$ZK_HOME/zookeeper-3.4.3.jar ./bin/hadoop jar import-into-hbase.jar

设置当前Job执行的HADOOP_CLASSPATH变量,只对当前Job有效,所以没有必要在.bashrc中进行配置。

  • 第三步:运行开发的Job

运行我们开发的Job,通过命令行输入HADOOP_CLASSPATH变量,以及使用-libjars选项指定当前这个Job依赖的第三方jar文件,启动命令行如下所示:

1 xiaoxiang@ubuntu3:~/hadoop$ HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.94.1.jar:$ZK_HOME/zookeeper-3.4.3.jar ./bin/hadoop jar import-into-hbase.jar org.shirdrn.kodz.inaction.hbase.job.importing.ImportDataDriver -libjars $HBASE_HOME/hbase-0.94.1.jar,$HBASE_HOME/lib/protobuf-java-2.4.0a.jar,$ZK_HOME/zookeeper-3.4.3.jar t_sub_domains /user/xiaoxiang/datasets/domains/

需要注意的是,环境变量中内容使用冒号分隔,而-libjars选项中的内容使用逗号分隔。

这样,我们就能够正确运行开发的Job了。
下面看看我们开发的Job运行的结果:

001 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.3-1240972, built on 02/06/2012 10:48 GMT
002 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:host.name=ubuntu3
003 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_30
004 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc.
005 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_30/jre
006 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/opt/stone/cloud/hadoop-1.0.3/libexec/../conf:/usr/java/jdk1.6.0_30/lib/tools.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/..:/opt/stone/cloud/hadoop-1.0.3/libexec/../hadoop-core-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/asm-3.2.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/aspectjrt-1.6.5.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/aspectjtools-1.6.5.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-beanutils-1.7.0.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-beanutils-core-1.8.0.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-cli-1.2.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-codec-1.4.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-collections-3.2.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-configuration-1.6.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-daemon-1.0.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-digester-1.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-el-1.0.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-httpclient-3.0.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-io-2.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-lang-2.4.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-logging-1.1.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-logging-api-1.0.4.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-math-2.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-net-1.4.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/core-3.1.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hadoop-capacity-scheduler-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hadoop-datajoin-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hadoop-fairscheduler-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hadoop-thriftfs-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hsqldb-1.8.0.10.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jackson-core-asl-1.8.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jasper-compiler-5.5.12.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jasper-runtime-5.5.12.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jdeb-0.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jersey-core-1.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jersey-json-1.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jersey-server-1.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jets3t-0.6.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jetty-6.1.26.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jetty-util-6.1.26.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jsch-0.1.42.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/junit-4.5.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/kfs-0.2.2.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/log4j-1.2.15.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/mockito-all-1.8.5.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/oro-2.0.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/protobuf-java-2.4.0a.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/servlet-api-2.5-20081211.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/slf4j-api-1.4.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/slf4j-log4j12-1.4.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/xmlenc-0.52.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jsp-2.1/jsp-2.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jsp-2.1/jsp-api-2.1.jar:/opt/stone/cloud/hbase-0.94.1/hbase-0.94.1.jar:/opt/stone/cloud/zookeeper-3.4.3/zookeeper-3.4.3.jar
007 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/native/Linux-amd64-64
008 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
009 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
010 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
011 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
012 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:os.version=3.0.0-12-server
013 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:user.name=xiaoxiang
014 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/xiaoxiang
015 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:user.dir=/opt/stone/cloud/hadoop-1.0.3
016 13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ubuntu3:2222 sessionTimeout=180000 watcher=hconnection
017 13/04/10 22:03:32 INFO zookeeper.ClientCnxn: Opening socket connection to server /172.0.8.252:2222
018 13/04/10 22:03:32 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 17561@ubuntu3
019 13/04/10 22:03:32 WARN client.ZooKeeperSaslClient: SecurityException: java.lang.SecurityException: Unable to locate a login configuration occurred when trying to find JAAS configuration.
020 13/04/10 22:03:32 INFO client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client' could not be found. If you are not using SASL, you may ignore this. On the other hand, if you expected SASL to work, please fix your JAAS configuration.
021 13/04/10 22:03:32 INFO zookeeper.ClientCnxn: Socket connection established to ubuntu3/172.0.8.252:2222, initiating session
022 13/04/10 22:03:32 INFO zookeeper.ClientCnxn: Session establishment complete on server ubuntu3/172.0.8.252:2222, sessionid = 0x13decd0f3960042, negotiated timeout = 180000
023 13/04/10 22:03:32 INFO mapreduce.TableOutputFormat: Created table instance for t_sub_domains
024 13/04/10 22:03:32 INFO input.FileInputFormat: Total input paths to process : 1
025 13/04/10 22:03:32 INFO util.NativeCodeLoader: Loaded the native-hadoop library
026 13/04/10 22:03:32 WARN snappy.LoadSnappy: Snappy native library not loaded
027 13/04/10 22:03:32 INFO mapred.JobClient: Running job: job_201303302227_0034
028 13/04/10 22:03:33 INFO mapred.JobClient: map 0% reduce 0%
029 13/04/10 22:03:50 INFO mapred.JobClient: map 2% reduce 0%
030 13/04/10 22:03:53 INFO mapred.JobClient: map 3% reduce 0%
031 13/04/10 22:03:56 INFO mapred.JobClient: map 4% reduce 0%
032 13/04/10 22:03:59 INFO mapred.JobClient: map 6% reduce 0%
033 13/04/10 22:04:03 INFO mapred.JobClient: map 7% reduce 0%
034 13/04/10 22:04:06 INFO mapred.JobClient: map 8% reduce 0%
035 13/04/10 22:04:09 INFO mapred.JobClient: map 10% reduce 0%
036 13/04/10 22:04:15 INFO mapred.JobClient: map 12% reduce 0%
037 13/04/10 22:04:18 INFO mapred.JobClient: map 13% reduce 0%
038 13/04/10 22:04:21 INFO mapred.JobClient: map 14% reduce 0%
039 13/04/10 22:04:24 INFO mapred.JobClient: map 15% reduce 0%
040 13/04/10 22:04:27 INFO mapred.JobClient: map 17% reduce 0%
041 13/04/10 22:04:33 INFO mapred.JobClient: map 18% reduce 0%
042 13/04/10 22:04:36 INFO mapred.JobClient: map 19% reduce 0%
043 13/04/10 22:04:39 INFO mapred.JobClient: map 20% reduce 0%
044 13/04/10 22:04:42 INFO mapred.JobClient: map 21% reduce 0%
045 13/04/10 22:04:45 INFO mapred.JobClient: map 23% reduce 0%
046 13/04/10 22:04:48 INFO mapred.JobClient: map 24% reduce 0%
047 13/04/10 22:04:51 INFO mapred.JobClient: map 25% reduce 0%
048 13/04/10 22:04:54 INFO mapred.JobClient: map 27% reduce 0%
049 13/04/10 22:04:57 INFO mapred.JobClient: map 28% reduce 0%
050 13/04/10 22:05:00 INFO mapred.JobClient: map 29% reduce 0%
051 13/04/10 22:05:03 INFO mapred.JobClient: map 31% reduce 0%
052 13/04/10 22:05:06 INFO mapred.JobClient: map 32% reduce 0%
053 13/04/10 22:05:09 INFO mapred.JobClient: map 33% reduce 0%
054 13/04/10 22:05:12 INFO mapred.JobClient: map 34% reduce 0%
055 13/04/10 22:05:15 INFO mapred.JobClient: map 35% reduce 0%
056 13/04/10 22:05:18 INFO mapred.JobClient: map 37% reduce 0%
057 13/04/10 22:05:21 INFO mapred.JobClient: map 38% reduce 0%
058 13/04/10 22:05:24 INFO mapred.JobClient: map 39% reduce 0%
059 13/04/10 22:05:27 INFO mapred.JobClient: map 41% reduce 0%
060 13/04/10 22:05:30 INFO mapred.JobClient: map 42% reduce 0%
061 13/04/10 22:05:33 INFO mapred.JobClient: map 43% reduce 0%
062 13/04/10 22:05:36 INFO mapred.JobClient: map 44% reduce 0%
063 13/04/10 22:05:39 INFO mapred.JobClient: map 46% reduce 0%
064 13/04/10 22:05:42 INFO mapred.JobClient: map 47% reduce 0%
065 13/04/10 22:05:45 INFO mapred.JobClient: map 48% reduce 0%
066 13/04/10 22:05:48 INFO mapred.JobClient: map 50% reduce 0%
067 13/04/10 22:05:54 INFO mapred.JobClient: map 52% reduce 0%
068 13/04/10 22:05:57 INFO mapred.JobClient: map 53% reduce 0%
069 13/04/10 22:06:00 INFO mapred.JobClient: map 54% reduce 0%
070 13/04/10 22:06:03 INFO mapred.JobClient: map 55% reduce 0%
071 13/04/10 22:06:06 INFO mapred.JobClient: map 57% reduce 0%
072 13/04/10 22:06:12 INFO mapred.JobClient: map 59% reduce 0%
073 13/04/10 22:06:15 INFO mapred.JobClient: map 60% reduce 0%
074 13/04/10 22:06:18 INFO mapred.JobClient: map 61% reduce 0%
075 13/04/10 22:06:21 INFO mapred.JobClient: map 62% reduce 0%
076 13/04/10 22:06:24 INFO mapred.JobClient: map 63% reduce 0%
077 13/04/10 22:06:27 INFO mapred.JobClient: map 64% reduce 0%
078 13/04/10 22:06:30 INFO mapred.JobClient: map 66% reduce 0%
079 13/04/10 22:06:33 INFO mapred.JobClient: map 67% reduce 0%
080 13/04/10 22:06:36 INFO mapred.JobClient: map 68% reduce 0%
081 13/04/10 22:06:42 INFO mapred.JobClient: map 69% reduce 0%
082 13/04/10 22:06:45 INFO mapred.JobClient: map 70% reduce 0%
083 13/04/10 22:06:48 INFO mapred.JobClient: map 71% reduce 0%
084 13/04/10 22:06:51 INFO mapred.JobClient: map 73% reduce 0%
085 13/04/10 22:06:54 INFO mapred.JobClient: map 74% reduce 0%
086 13/04/10 22:06:57 INFO mapred.JobClient: map 75% reduce 0%
087 13/04/10 22:07:00 INFO mapred.JobClient: map 77% reduce 0%
088 13/04/10 22:07:03 INFO mapred.JobClient: map 78% reduce 0%
089 13/04/10 22:07:12 INFO mapred.JobClient: map 79% reduce 0%
090 13/04/10 22:07:18 INFO mapred.JobClient: map 80% reduce 0%
091 13/04/10 22:07:24 INFO mapred.JobClient: map 81% reduce 0%
092 13/04/10 22:07:30 INFO mapred.JobClient: map 82% reduce 0%
093 13/04/10 22:07:36 INFO mapred.JobClient: map 83% reduce 0%
094 13/04/10 22:07:48 INFO mapred.JobClient: map 84% reduce 0%
095 13/04/10 22:07:51 INFO mapred.JobClient: map 85% reduce 0%
096 13/04/10 22:07:59 INFO mapred.JobClient: map 86% reduce 0%
097 13/04/10 22:08:05 INFO mapred.JobClient: map 87% reduce 0%
098 13/04/10 22:08:11 INFO mapred.JobClient: map 88% reduce 0%
099 13/04/10 22:08:17 INFO mapred.JobClient: map 89% reduce 0%
100 13/04/10 22:08:23 INFO mapred.JobClient: map 90% reduce 0%
101 13/04/10 22:08:29 INFO mapred.JobClient: map 91% reduce 0%
102 13/04/10 22:08:35 INFO mapred.JobClient: map 92% reduce 0%
103 13/04/10 22:08:41 INFO mapred.JobClient: map 93% reduce 0%
104 13/04/10 22:08:47 INFO mapred.JobClient: map 94% reduce 0%
105 13/04/10 22:08:53 INFO mapred.JobClient: map 95% reduce 0%
106 13/04/10 22:08:59 INFO mapred.JobClient: map 96% reduce 0%
107 13/04/10 22:09:05 INFO mapred.JobClient: map 97% reduce 0%
108 13/04/10 22:09:11 INFO mapred.JobClient: map 98% reduce 0%
109 13/04/10 22:09:17 INFO mapred.JobClient: map 99% reduce 0%
110 13/04/10 22:09:23 INFO mapred.JobClient: map 100% reduce 0%
111 13/04/10 22:09:31 INFO mapred.JobClient: Job complete: job_201303302227_0034
112 13/04/10 22:09:31 INFO mapred.JobClient: Counters: 18
113 13/04/10 22:09:31 INFO mapred.JobClient: Job Counters
114 13/04/10 22:09:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=550605
115 13/04/10 22:09:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
116 13/04/10 22:09:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
117 13/04/10 22:09:31 INFO mapred.JobClient: Launched map tasks=2
118 13/04/10 22:09:31 INFO mapred.JobClient: Data-local map tasks=2
119 13/04/10 22:09:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
120 13/04/10 22:09:31 INFO mapred.JobClient: File Output Format Counters
121 13/04/10 22:09:31 INFO mapred.JobClient: Bytes Written=0
122 13/04/10 22:09:31 INFO mapred.JobClient: FileSystemCounters
123 13/04/10 22:09:31 INFO mapred.JobClient: HDFS_BYTES_READ=104394990
124 13/04/10 22:09:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=64078
125 13/04/10 22:09:31 INFO mapred.JobClient: File Input Format Counters
126 13/04/10 22:09:31 INFO mapred.JobClient: Bytes Read=104394710
127 13/04/10 22:09:31 INFO mapred.JobClient: Map-Reduce Framework
128 13/04/10 22:09:31 INFO mapred.JobClient: Map input records=4995670
129 13/04/10 22:09:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=279134208
130 13/04/10 22:09:31 INFO mapred.JobClient: Spilled Records=0
131 13/04/10 22:09:31 INFO mapred.JobClient: CPU time spent (ms)=129130
132 13/04/10 22:09:31 INFO mapred.JobClient: Total committed heap usage (bytes)=202833920
133 13/04/10 22:09:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1170251776
134 13/04/10 22:09:31 INFO mapred.JobClient: Map output records=4995670
135 13/04/10 22:09:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=280

可以看到,除了加载Hadoop对应的HADOOP_HOME变量指定的路径下,lib*目录下的jar文件以外,还加载了我们设置的-libjars选项中指定的第三方jar文件,供Job运行时使用。

将Job代码和依赖jar文件打包

我比较喜欢这种方式,因为这样做首先利用饿Maven的很多优点,如管理依赖、自动构建。另外,对于其他想要使用该Job的开发人员或部署人员,无需关系更多的配置,只要按照Maven的构建规则去构建,就可以生成最终的部署文件,从而也就减少了在执行Job的时候,出现各种常见的问题(如CLASSPATH设置有问题等)。
使用如下的Maven构建插件配置,执行mvn package命令,就可以完成这些任务:

01 <build>
02 <plugins>
03 <plugin>
04 <artifactId>maven-assembly-plugin</artifactId>
05 <configuration>
06 <archive>
07 <manifest>
08 <mainClass>org.shirdrn.solr.cloud.index.hadoop.SolrCloudIndexer</mainClass>
09 </manifest>
10 </archive>
11 <descriptorRefs>
12 <descriptorRef>jar-with-dependencies</descriptorRef>
13 </descriptorRefs>
14 </configuration>
15 <executions>
16 <execution>
17 <id>make-assembly</id>
18 <phase>package</phase>
19 <goals>
20 <goal>single</goal>
21 </goals>
22 </execution>
23 </executions>
24 </plugin>
25 </plugins>
26 </build>

最后生成的jar文件在target目录下面,例如名称类似solr-platform-2.0-jar-with-dependencies.jar,然后可以直接拷贝这个文件到指定的目录,提交到Hadoop计算集群运行。

上一篇:PostgreSQL 9.6.0中文手册1.0版发布了(附pdf和chm)


下一篇:新闻发布系统,B/S模式下的三层应用