You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Tom Davidson <td...@covario.com> on 2011/08/01 20:59:13 UTC

Nutch 2 and Cassandra

Hi All,

I am kind of at my wit's end here, so I am hoping someone here can help.  I am trying to use Nutch2 and Cassandra and I have been successful using the runtime/local build. I am using the Cloudera CDH3 on CentOs 5 and I do not want to contaminate by hadoop install by dropping in a bunch of Nutch jars, etc. So I am trying to use the nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get the error below.  I have double and triple checked the classpath and the included jars and the only jar that contains FieldValueMetaData is the libthrift-0.6.1.jar which has the method that is claimed to be missing. Any ideas?

Thanks,
Tom




[tdavidson@nadevsan06 ~]$ bin/nutch inject urls
/opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20 -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar /home/SEMDIRECTOR/tdavidson/nutch-2.job org.apache.nutch.crawl.InjectorJob urls
11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector
11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob: org.apache.gora.util.GoraException: java.lang.reflect.InvocationTargetException
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
        ... 12 more
Caused by: java.lang.NoSuchMethodError: org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
        at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
        at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
        at org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Cassandra.java:24338)
        at org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Cassandra.java:1371)
        at org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassandra.java:1346)
        at me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractCluster.java:192)
        at me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractCluster.java:187)
        at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
        at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
        at me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(AbstractCluster.java:201)
        at org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(CassandraClient.java:82)
        at org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.java:69)
        at org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:68)
        ... 18 more

RE: Nutch 2 and Cassandra

Posted by Tom Davidson <td...@covario.com>.
I don't know how to edit the wiki directly (I think I need an account). But I do have additional information on the NoSuchMethodError when using CDH3.  If you install only the CDH3 distro, you are OK. It is when you add the Hue distros or try to use a Hadoop installed with the Cloudera SCM products that you run into problems. I had to remove hue-plugins-1.2.0-cdh3u1.jar from my hadoop lib folder (/usr/lib/hadoop-0.20/lib).

Setting the HADOOP_OPTS in hadoop-env.sh works.

From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
Sent: Wednesday, August 03, 2011 3:03 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Hi Tom,

OK I've added the errors you discussed above in the following parts of the wiki respectively [1] & [2].

It would be great if you could have a look over them and edit them as you see fit to correct any misinterpretations. In the latter section I have intentionally not added in your last comments as I was trying to think if it's possible to add this variable to hadoop-env.sh or something similar. Can you comment/advise/edit?

Thank you

[1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_2.0_and_Apache_Cassandra
[2] http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Missing_plugins_whilst_running_Nutch_2.0_on_Cloudera.27s_CDH3
On Tue, Aug 2, 2011 at 10:36 PM, Tom Davidson <td...@covario.com>> wrote:
I did run into a couple more problems running Nutch 2 with CDH3. See https://issues.apache.org/jira/browse/NUTCH-937. I added a comment on the thread explaining my additional problem. I worked around the problem by unjarring the nutch-2-dev.job and seeting the HADOOP_CLASSPATH (see below) environment variable. Not an ideal solution, but it works.

In order to run Nutch 2 on CDH3 I added the following to nutch-site.xml and rebuilt the nutch-2-dev.job:

    <property>
        <name>mapreduce.job.jar.unpack.pattern</name>
        <value>(?:classes/|lib/|plugins/).*</value>
    </property>

    <property>
        <name>plugin.folders</name>
        <value>${job.local.dir}/../jars/plugins</value>
    </property>

And I had to set this environment variable to my expanded plugins folder:

export HADOOP_OPTS="-Djob.local.dir=/<MY HOME>/nutch/plugins"





From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com<ma...@gmail.com>]
Sent: Tuesday, August 02, 2011 2:00 PM

To: dev@nutch.apache.org<ma...@nutch.apache.org>
Subject: Re: Nutch 2 and Cassandra

Hi

I've been watching progress on this thread with interest and think that this would be a great addition to the wiki under the following page [1]

I am happy to write it up, however is there anything else we need to be aware of in addition to the material you have provided, for example some latent info that has been assumed or not been explained.

Thank you

[1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2
On Tue, Aug 2, 2011 at 6:32 PM, Tom Davidson <td...@covario.com>> wrote:
I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar with an older thrift library in it. I removed the jar from my classpath and all is good. Thanks for your help.

-----Original Message-----
From: Tom Davidson [mailto:tdavidson@covario.com<ma...@covario.com>]
Sent: Monday, August 01, 2011 3:29 PM
To: dev@nutch.apache.org<ma...@nutch.apache.org>
Subject: RE: Nutch 2 and Cassandra

OK... Are you running with a clustered version of Hadoop? I think you have to have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have been able to run in local mode, but not in deployed mode.


-----Original Message-----
From: Alexis [mailto:alexis.detreglode@gmail.com<ma...@gmail.com>]
Sent: Monday, August 01, 2011 3:25 PM
To: dev@nutch.apache.org<ma...@nutch.apache.org>
Subject: Re: Nutch 2 and Cassandra

Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 10000
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository:         the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository:         HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository:         OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_000000_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO mapred.LocalJobRunner:
11/08/01 15:17:51 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_000000_0' done.
11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_READ=44872735
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45245279
11/08/01 15:17:52 INFO mapred.JobClient:   Map-Reduce Framework
11/08/01 15:17:52 INFO mapred.JobClient:     Map input records=3
11/08/01 15:17:52 INFO mapred.JobClient:     Spilled Records=0
11/08/01 15:17:52 INFO mapred.JobClient:     Map output records=3
11/08/01 15:17:52 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished



This is what was added to ivy/ivy.xml:

+       <dependency org="org.apache.gora" name="gora-cassandra"
rev="0.2-incubating" conf="*->compile"/>
+       <dependency org="org.apache.cassandra" name="cassandra-thrift"
rev="0.8.1"/>
+       <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.github.stephenc.high-scale-lib"
name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.collections"
name="google-collections" rev="1.0" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.guava" name="guava" rev="r09"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="org.apache.cassandra" name="apache-cassandra"
rev="0.8.1"/>
+       <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"/>



On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson <td...@covario.com>> wrote:
> I did something similar to below to add the Cassandra dependencies. Note that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the hector jars to your nutch job jar and see what you get? I think I am one step ahead of you. BTW, I just added this line to get the hector dependency:
>
>        <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2" conf="*->default"/>
>
> -----Original Message-----
> From: Alexis [mailto:alexis.detreglode@gmail.com<ma...@gmail.com>]
> Sent: Monday, August 01, 2011 2:28 PM
> To: dev@nutch.apache.org<ma...@nutch.apache.org>
> Subject: Re: Nutch 2 and Cassandra
>
> Hi, libthrift is a dependency of cassandra-thrift, as listed here:
> http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
>
> During Nutch build, you have to manually tweak the Ivy configuration depending on your choice of the Gora store, in this case Cassandra.
> Basically you need to add all the dependencies listed there:
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup
>
> Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then let's rebuild Nutch (see attached patch):
>        <dependency org="org.apache.gora" name="gora-cassandra"
> rev="0.2-incubating" conf="*->compile"/>
>        <dependency org="org.apache.cassandra" name="cassandra-thrift" rev="0.8.1"/>
>        <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.github.stephenc.high-scale-lib"
> name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.collections" name="google-collections"
> rev="1.0" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.guava" name="guava" rev="r09"
> conf="*->*,!javadoc,!sources"/>
>
> $ ant clean
> $ ant
>
> In your case libthrift should now be downloaded by Ivy and then bundled into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got included in your classpath...
>
> Somehow we need to resolve as well:
>        <dependency org="org.apache.cassandra" name="apache-cassandra"
> rev="0.8.1"/>
>        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>
>
> I don't think the following 2 jars are in the default maven repository so they won't be downloaded, that's why they were commented in the Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)
>
>
> Since hector jar is not found in my case I get:
> ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
> /home/alex/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
> org.apache.gora.util.GoraException:
> java.lang.reflect.InvocationTargetException
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
>        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
> Caused by: java.lang.reflect.InvocationTargetException
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
>        ... 12 more
> Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
>        at org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
>        ... 18 more
> Caused by: java.lang.ClassNotFoundException:
> me.prettyprint.hector.api.Serializer
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>        ... 19 more
>
>
>
>
> On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com>> wrote:
>> Hi All,
>>
>>
>>
>> I am kind of at my wit's end here, so I am hoping someone here can
>> help.  I am trying to use Nutch2 and Cassandra and I have been
>> successful using the runtime/local build. I am using the Cloudera CDH3
>> on CentOs 5 and I do not want to contaminate by hadoop install by
>> dropping in a bunch of Nutch jars, etc. So I am trying to use the
>> nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get
>> the error below.  I have double and triple checked the classpath and
>> the included jars and the only jar that contains FieldValueMetaData is
>> the libthrift-0.6.1.jar which has the method that is claimed to be missing. Any ideas?
>>
>>
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>>
>> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
>> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs
>> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
>> -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console
>> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
>> -Dhadoop.policy.file=hadoop-policy.xml -classpath
>> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hado
>> op-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/ha
>> doop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt
>> -1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/ha
>> doop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-cod
>> ec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/
>> hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-ht
>> tpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:
>> /usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop
>> -0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.ja
>> r:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr
>> /lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-
>> plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5
>> .2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/
>> hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/ja
>> sper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr
>> /lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-s
>> ervlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.ja
>> r:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/ju
>> nit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.2
>> 0/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:
>> /usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servle
>> t-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14
>> .jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20
>> /lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:
>> /usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/
>> jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar
>> /home/SEMDIRECTOR/tdavidson/nutch-2.job
>> org.apache.nutch.crawl.InjectorJob urls
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
>>
>> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed
>> Host Retry service started with queue size -1 and retry delay 10s
>>
>> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX
>> me.prettyprint.cassandra.service_Test
>> Cluster:ServiceType=hector,MonitorType=hector
>>
>> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
>> org.apache.gora.util.GoraException:
>> java.lang.reflect.InvocationTargetException
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:110)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:93)
>>
>>         at
>> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java
>> :59)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>>
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> ava:39)
>>
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> orImpl.java:25)
>>
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>> Caused by: java.lang.reflect.InvocationTargetException
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
>> rAccessorImpl.java:39)
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
>> nstructorAccessorImpl.java:27)
>>
>>         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>
>>         at
>> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>> 76)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:102)
>>
>>         ... 12 more
>>
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
>>
>>         at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
>>
>>         at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Ca
>> ssandra.java:24338)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Ca
>> ssandra.java:1371)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassand
>> ra.java:1346)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:192)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:187)
>>
>>         at
>> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operati
>> on.java:101)
>>
>>         at
>> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail
>> over(HConnectionManager.java:232)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(Abst
>> ractCluster.java:201)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(Cassandr
>> aClient.java:82)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.j
>> ava:69)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.j
>> ava:68)
>>
>>         ... 18 more
>



--
Lewis



--
Lewis

Re: Nutch 2 and Cassandra

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Tom,

OK I've added the errors you discussed above in the following parts of the
wiki respectively [1] & [2].

It would be great if you could have a look over them and edit them as you
see fit to correct any misinterpretations. In the latter section I have
intentionally not added in your last comments as I was trying to think if
it's possible to add this variable to hadoop-env.sh or something similar.
Can you comment/advise/edit?

Thank you

[1]
http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_2.0_and_Apache_Cassandra
[2]
http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Missing_plugins_whilst_running_Nutch_2.0_on_Cloudera.27s_CDH3

On Tue, Aug 2, 2011 at 10:36 PM, Tom Davidson <td...@covario.com> wrote:

>  I did run into a couple more problems running Nutch 2 with CDH3. See
> https://issues.apache.org/jira/browse/NUTCH-937. I added a comment on the
> thread explaining my additional problem. I worked around the problem by
> unjarring the nutch-2-dev.job and seeting the HADOOP_CLASSPATH (see below)
> environment variable. Not an ideal solution, but it works.****
>
> ** **
>
> In order to run Nutch 2 on CDH3 I added the following to nutch-site.xml and
> rebuilt the nutch-2-dev.job:****
>
> ** **
>
> ****
>
>     <property>****
>
>         <name>mapreduce.job.jar.unpack.pattern</name>****
>
>         <value>(?:classes/|lib/|plugins/).*</value>****
>
>     </property>****
>
> ** **
>
>     <property>****
>
>         <name>plugin.folders</name>****
>
>         <value>${job.local.dir}/../jars/plugins</value>****
>
>     </property>****
>
> ** **
>
> And I had to set this environment variable to my expanded plugins folder:*
> ***
>
> ** **
>
> export HADOOP_OPTS="-Djob.local.dir=/<MY HOME>/nutch/plugins"****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> *Sent:* Tuesday, August 02, 2011 2:00 PM
>
> *To:* dev@nutch.apache.org
> *Subject:* Re: Nutch 2 and Cassandra****
>
>  ** **
>
> Hi
>
> I've been watching progress on this thread with interest and think that
> this would be a great addition to the wiki under the following page [1]
>
> I am happy to write it up, however is there anything else we need to be
> aware of in addition to the material you have provided, for example some
> latent info that has been assumed or not been explained.
>
> Thank you
>
> [1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2****
>
> On Tue, Aug 2, 2011 at 6:32 PM, Tom Davidson <td...@covario.com>
> wrote:****
>
> I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar
> with an older thrift library in it. I removed the jar from my classpath and
> all is good. Thanks for your help.****
>
>
> -----Original Message-----
> From: Tom Davidson [mailto:tdavidson@covario.com]
> Sent: Monday, August 01, 2011 3:29 PM
> To: dev@nutch.apache.org****
>
> Subject: RE: Nutch 2 and Cassandra
>
> OK... Are you running with a clustered version of Hadoop? I think you have
> to have your HADOOP_HOME env variable set. Otherwise it runs in local mode.
> I have been able to run in local mode, but not in deployed mode.
>
>
> -----Original Message-----
> From: Alexis [mailto:alexis.detreglode@gmail.com]
> Sent: Monday, August 01, 2011 3:25 PM
> To: dev@nutch.apache.org
> Subject: Re: Nutch 2 and Cassandra
>
> Ok this version of hector was properly resolved. Thanks!
>
> These are the logs:
> ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
> ~/java/workspace/Nutch/seeds
> 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
> 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
> /home/alex/java/workspace/Nutch/seeds
> 11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
> Host Retry service started with queue size -1 and retry delay 10s
> 11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Test
> Cluster:ServiceType=hector,MonitorType=hector
> 11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
> cluster 'Test Cluster' was created on host 'localhost'
> 11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process
> : 1
> 11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
> 11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process
> : 1
> 11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
> gora.buffer.write.limit = 10000
> 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
> /tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
> 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         the nutch core
> extension points (nutch-extensionpoints)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic URL
> Normalizer (urlnormalizer-basic)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic Indexing
> Filter (index-basic)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Html Parse
> Plug-in (parse-html)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         HTTP Framework
> (lib-http)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Pass-through
> URL Normalizer (urlnormalizer-pass)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
> Filter (urlfilter-regex)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Http Protocol
> Plug-in (protocol-http)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
> Normalizer (urlnormalizer-regex)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Tika Parser
> Plug-in (parse-tika)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         OPIC Scoring
> Plug-in (scoring-opic)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         CyberNeko HTML
> Parser (lib-nekohtml)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Anchor
> Indexing Filter (index-anchor)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
> Filter Framework (lib-regex-filter)
> 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Parse Filter
> (org.apache.nutch.parse.ParseFilter)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 11/08/01 15:17:50 INFO conf.Configuration: found resource
> regex-normalize.xml at
> file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
> 11/08/01 15:17:50 INFO conf.Configuration: found resource
> regex-urlfilter.txt at
> file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
> 11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
> scope 'inject', using default
> 11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
> 11/08/01 15:17:51 INFO mapred.TaskRunner:
> Task:attempt_local_0001_m_000000_0 is done. And is in the process of
> commiting
> 11/08/01 15:17:51 INFO mapred.LocalJobRunner:
> 11/08/01 15:17:51 INFO mapred.TaskRunner: Task
> 'attempt_local_0001_m_000000_0' done.
> 11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
> 11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
> 11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
> 11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
> 11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_READ=44872735
> 11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45245279
> 11/08/01 15:17:52 INFO mapred.JobClient:   Map-Reduce Framework
> 11/08/01 15:17:52 INFO mapred.JobClient:     Map input records=3
> 11/08/01 15:17:52 INFO mapred.JobClient:     Spilled Records=0
> 11/08/01 15:17:52 INFO mapred.JobClient:     Map output records=3
> 11/08/01 15:17:52 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
> with processName=JobTracker, sessionId= - already initialized
> 11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished
>
>
>
> This is what was added to ivy/ivy.xml:
>
> +       <dependency org="org.apache.gora" name="gora-cassandra"
> rev="0.2-incubating" conf="*->compile"/>
> +       <dependency org="org.apache.cassandra" name="cassandra-thrift"
> rev="0.8.1"/>
> +       <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> conf="*->*,!javadoc,!sources"/>
> +       <dependency org="com.github.stephenc.high-scale-lib"
> name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
> +       <dependency org="com.google.collections"
> name="google-collections" rev="1.0" conf="*->*,!javadoc,!sources"/>
> +       <dependency org="com.google.guava" name="guava" rev="r09"
> conf="*->*,!javadoc,!sources"/>
> +       <dependency org="org.apache.cassandra" name="apache-cassandra"
> rev="0.8.1"/>
> +       <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"/>
>
>
>
> On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson <td...@covario.com>
> wrote:
> > I did something similar to below to add the Cassandra dependencies. Note
> that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you
> add the hector jars to your nutch job jar and see what you get? I think I am
> one step ahead of you. BTW, I just added this line to get the hector
> dependency:
> >
> >        <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"
> conf="*->default"/>
> >
> > -----Original Message-----
> > From: Alexis [mailto:alexis.detreglode@gmail.com]
> > Sent: Monday, August 01, 2011 2:28 PM
> > To: dev@nutch.apache.org
> > Subject: Re: Nutch 2 and Cassandra
> >
> > Hi, libthrift is a dependency of cassandra-thrift, as listed here:
> >
> http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
> >
> > During Nutch build, you have to manually tweak the Ivy configuration
> depending on your choice of the Gora store, in this case Cassandra.
> > Basically you need to add all the dependencies listed there:
> >
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup
> >
> > Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies
> and then let's rebuild Nutch (see attached patch):
> >        <dependency org="org.apache.gora" name="gora-cassandra"
> > rev="0.2-incubating" conf="*->compile"/>
> >        <dependency org="org.apache.cassandra" name="cassandra-thrift"
> rev="0.8.1"/>
> >        <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> > conf="*->*,!javadoc,!sources"/>
> >        <dependency org="com.github.stephenc.high-scale-lib"
> > name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
> >        <dependency org="com.google.collections" name="google-collections"
> > rev="1.0" conf="*->*,!javadoc,!sources"/>
> >        <dependency org="com.google.guava" name="guava" rev="r09"
> > conf="*->*,!javadoc,!sources"/>
> >
> > $ ant clean
> > $ ant
> >
> > In your case libthrift should now be downloaded by Ivy and then bundled
> into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and
> hector got included in your classpath...
> >
> > Somehow we need to resolve as well:
> >        <dependency org="org.apache.cassandra" name="apache-cassandra"
> > rev="0.8.1"/>
> >        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>
> >
> > I don't think the following 2 jars are in the default maven repository so
> they won't be downloaded, that's why they were commented in the Gora
> Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)
> >
> >
> > Since hector jar is not found in my case I get:
> > ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
> ~/java/workspace/Nutch/seeds
> > 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
> > 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
> > /home/alex/java/workspace/Nutch/seeds
> > 11/08/01 14:18:42 INFO security.Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> > cacheTimeout=300000
> > 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> > 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
> > org.apache.gora.util.GoraException:
> > java.lang.reflect.InvocationTargetException
> >        at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
> >        at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
> >        at
> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
> >        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
> >        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
> >        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> >        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
> > Caused by: java.lang.reflect.InvocationTargetException
> >        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> >        at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> >        at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> >        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >        at
> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
> >        at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
> >        ... 12 more
> > Caused by: java.lang.NoClassDefFoundError:
> me/prettyprint/hector/api/Serializer
> >        at
> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
> >        ... 18 more
> > Caused by: java.lang.ClassNotFoundException:
> > me.prettyprint.hector.api.Serializer
> >        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> >        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> >        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> >        ... 19 more
> >
> >
> >
> >
> > On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com>
> wrote:
> >> Hi All,
> >>
> >>
> >>
> >> I am kind of at my wit's end here, so I am hoping someone here can
> >> help.  I am trying to use Nutch2 and Cassandra and I have been
> >> successful using the runtime/local build. I am using the Cloudera CDH3
> >> on CentOs 5 and I do not want to contaminate by hadoop install by
> >> dropping in a bunch of Nutch jars, etc. So I am trying to use the
> >> nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get
> >> the error below.  I have double and triple checked the classpath and
> >> the included jars and the only jar that contains FieldValueMetaData is
> >> the libthrift-0.6.1.jar which has the method that is claimed to be
> missing. Any ideas?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Tom
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
> >>
> >> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
> >> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs
> >> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
> >> -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console
> >> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
> >> -Dhadoop.policy.file=hadoop-policy.xml -classpath
> >> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hado
> >> op-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/ha
> >> doop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt
> >> -1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/ha
> >> doop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-cod
> >> ec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/
> >> hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-ht
> >> tpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:
> >> /usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop
> >> -0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.ja
> >> r:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr
> >> /lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-
> >> plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5
> >> .2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/
> >> hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/ja
> >> sper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr
> >> /lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-s
> >> ervlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.ja
> >> r:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/ju
> >> nit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.2
> >> 0/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:
> >> /usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servle
> >> t-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14
> >> .jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20
> >> /lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:
> >> /usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/
> >> jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar
> >> /home/SEMDIRECTOR/tdavidson/nutch-2.job
> >> org.apache.nutch.crawl.InjectorJob urls
> >>
> >> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
> >>
> >> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
> >>
> >> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed
> >> Host Retry service started with queue size -1 and retry delay 10s
> >>
> >> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX
> >> me.prettyprint.cassandra.service_Test
> >> Cluster:ServiceType=hector,MonitorType=hector
> >>
> >> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
> >> org.apache.gora.util.GoraException:
> >> java.lang.reflect.InvocationTargetException
> >>
> >>         at
> >> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> >> y.java:110)
> >>
> >>         at
> >> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> >> y.java:93)
> >>
> >>         at
> >> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java
> >> :59)
> >>
> >>         at
> >> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
> >>
> >>         at
> >> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
> >>
> >>         at
> >> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
> >>
> >>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>
> >>         at
> >> org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
> >>
> >>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >>         at
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> >> ava:39)
> >>
> >>         at
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> >> orImpl.java:25)
> >>
> >>         at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >>
> >> Caused by: java.lang.reflect.InvocationTargetException
> >>
> >>         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> Method)
> >>
> >>         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
> >> rAccessorImpl.java:39)
> >>
> >>         at
> >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
> >> nstructorAccessorImpl.java:27)
> >>
> >>         at
> >> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >>
> >>         at
> >> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> >> 76)
> >>
> >>         at
> >> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> >> y.java:102)
> >>
> >>         ... 12 more
> >>
> >> Caused by: java.lang.NoSuchMethodError:
> >> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
> >>
> >>         at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
> >>
> >>         at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
> >>
> >>         at
> >> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Ca
> >> ssandra.java:24338)
> >>
> >>         at
> >> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Ca
> >> ssandra.java:1371)
> >>
> >>         at
> >> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassand
> >> ra.java:1346)
> >>
> >>         at
> >> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
> >> ster.java:192)
> >>
> >>         at
> >> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
> >> ster.java:187)
> >>
> >>         at
> >> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operati
> >> on.java:101)
> >>
> >>         at
> >> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail
> >> over(HConnectionManager.java:232)
> >>
> >>         at
> >> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(Abst
> >> ractCluster.java:201)
> >>
> >>         at
> >> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(Cassandr
> >> aClient.java:82)
> >>
> >>         at
> >> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.j
> >> ava:69)
> >>
> >>         at
> >> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.j
> >> ava:68)
> >>
> >>         ... 18 more
> >****
>
>
>
>
> --
> *Lewis* ****
>



-- 
*Lewis*

RE: Nutch 2 and Cassandra

Posted by Tom Davidson <td...@covario.com>.
I did run into a couple more problems running Nutch 2 with CDH3. See https://issues.apache.org/jira/browse/NUTCH-937. I added a comment on the thread explaining my additional problem. I worked around the problem by unjarring the nutch-2-dev.job and seeting the HADOOP_CLASSPATH (see below) environment variable. Not an ideal solution, but it works.

In order to run Nutch 2 on CDH3 I added the following to nutch-site.xml and rebuilt the nutch-2-dev.job:

    <property>
        <name>mapreduce.job.jar.unpack.pattern</name>
        <value>(?:classes/|lib/|plugins/).*</value>
    </property>

    <property>
        <name>plugin.folders</name>
        <value>${job.local.dir}/../jars/plugins</value>
    </property>

And I had to set this environment variable to my expanded plugins folder:

export HADOOP_OPTS="-Djob.local.dir=/<MY HOME>/nutch/plugins"





From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
Sent: Tuesday, August 02, 2011 2:00 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Hi

I've been watching progress on this thread with interest and think that this would be a great addition to the wiki under the following page [1]

I am happy to write it up, however is there anything else we need to be aware of in addition to the material you have provided, for example some latent info that has been assumed or not been explained.

Thank you

[1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2
On Tue, Aug 2, 2011 at 6:32 PM, Tom Davidson <td...@covario.com>> wrote:
I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar with an older thrift library in it. I removed the jar from my classpath and all is good. Thanks for your help.

-----Original Message-----
From: Tom Davidson [mailto:tdavidson@covario.com<ma...@covario.com>]
Sent: Monday, August 01, 2011 3:29 PM
To: dev@nutch.apache.org<ma...@nutch.apache.org>
Subject: RE: Nutch 2 and Cassandra

OK... Are you running with a clustered version of Hadoop? I think you have to have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have been able to run in local mode, but not in deployed mode.


-----Original Message-----
From: Alexis [mailto:alexis.detreglode@gmail.com<ma...@gmail.com>]
Sent: Monday, August 01, 2011 3:25 PM
To: dev@nutch.apache.org<ma...@nutch.apache.org>
Subject: Re: Nutch 2 and Cassandra

Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 10000
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository:         the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository:         HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository:         OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_000000_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO mapred.LocalJobRunner:
11/08/01 15:17:51 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_000000_0' done.
11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_READ=44872735
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45245279
11/08/01 15:17:52 INFO mapred.JobClient:   Map-Reduce Framework
11/08/01 15:17:52 INFO mapred.JobClient:     Map input records=3
11/08/01 15:17:52 INFO mapred.JobClient:     Spilled Records=0
11/08/01 15:17:52 INFO mapred.JobClient:     Map output records=3
11/08/01 15:17:52 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished



This is what was added to ivy/ivy.xml:

+       <dependency org="org.apache.gora" name="gora-cassandra"
rev="0.2-incubating" conf="*->compile"/>
+       <dependency org="org.apache.cassandra" name="cassandra-thrift"
rev="0.8.1"/>
+       <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.github.stephenc.high-scale-lib"
name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.collections"
name="google-collections" rev="1.0" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.guava" name="guava" rev="r09"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="org.apache.cassandra" name="apache-cassandra"
rev="0.8.1"/>
+       <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"/>



On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson <td...@covario.com>> wrote:
> I did something similar to below to add the Cassandra dependencies. Note that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the hector jars to your nutch job jar and see what you get? I think I am one step ahead of you. BTW, I just added this line to get the hector dependency:
>
>        <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2" conf="*->default"/>
>
> -----Original Message-----
> From: Alexis [mailto:alexis.detreglode@gmail.com<ma...@gmail.com>]
> Sent: Monday, August 01, 2011 2:28 PM
> To: dev@nutch.apache.org<ma...@nutch.apache.org>
> Subject: Re: Nutch 2 and Cassandra
>
> Hi, libthrift is a dependency of cassandra-thrift, as listed here:
> http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
>
> During Nutch build, you have to manually tweak the Ivy configuration depending on your choice of the Gora store, in this case Cassandra.
> Basically you need to add all the dependencies listed there:
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup
>
> Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then let's rebuild Nutch (see attached patch):
>        <dependency org="org.apache.gora" name="gora-cassandra"
> rev="0.2-incubating" conf="*->compile"/>
>        <dependency org="org.apache.cassandra" name="cassandra-thrift" rev="0.8.1"/>
>        <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.github.stephenc.high-scale-lib"
> name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.collections" name="google-collections"
> rev="1.0" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.guava" name="guava" rev="r09"
> conf="*->*,!javadoc,!sources"/>
>
> $ ant clean
> $ ant
>
> In your case libthrift should now be downloaded by Ivy and then bundled into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got included in your classpath...
>
> Somehow we need to resolve as well:
>        <dependency org="org.apache.cassandra" name="apache-cassandra"
> rev="0.8.1"/>
>        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>
>
> I don't think the following 2 jars are in the default maven repository so they won't be downloaded, that's why they were commented in the Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)
>
>
> Since hector jar is not found in my case I get:
> ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
> /home/alex/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
> org.apache.gora.util.GoraException:
> java.lang.reflect.InvocationTargetException
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
>        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
> Caused by: java.lang.reflect.InvocationTargetException
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
>        ... 12 more
> Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
>        at org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
>        ... 18 more
> Caused by: java.lang.ClassNotFoundException:
> me.prettyprint.hector.api.Serializer
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>        ... 19 more
>
>
>
>
> On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com>> wrote:
>> Hi All,
>>
>>
>>
>> I am kind of at my wit's end here, so I am hoping someone here can
>> help.  I am trying to use Nutch2 and Cassandra and I have been
>> successful using the runtime/local build. I am using the Cloudera CDH3
>> on CentOs 5 and I do not want to contaminate by hadoop install by
>> dropping in a bunch of Nutch jars, etc. So I am trying to use the
>> nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get
>> the error below.  I have double and triple checked the classpath and
>> the included jars and the only jar that contains FieldValueMetaData is
>> the libthrift-0.6.1.jar which has the method that is claimed to be missing. Any ideas?
>>
>>
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>>
>> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
>> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs
>> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
>> -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console
>> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
>> -Dhadoop.policy.file=hadoop-policy.xml -classpath
>> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hado
>> op-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/ha
>> doop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt
>> -1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/ha
>> doop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-cod
>> ec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/
>> hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-ht
>> tpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:
>> /usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop
>> -0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.ja
>> r:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr
>> /lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-
>> plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5
>> .2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/
>> hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/ja
>> sper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr
>> /lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-s
>> ervlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.ja
>> r:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/ju
>> nit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.2
>> 0/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:
>> /usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servle
>> t-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14
>> .jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20
>> /lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:
>> /usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/
>> jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar
>> /home/SEMDIRECTOR/tdavidson/nutch-2.job
>> org.apache.nutch.crawl.InjectorJob urls
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
>>
>> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed
>> Host Retry service started with queue size -1 and retry delay 10s
>>
>> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX
>> me.prettyprint.cassandra.service_Test
>> Cluster:ServiceType=hector,MonitorType=hector
>>
>> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
>> org.apache.gora.util.GoraException:
>> java.lang.reflect.InvocationTargetException
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:110)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:93)
>>
>>         at
>> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java
>> :59)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>>
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> ava:39)
>>
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> orImpl.java:25)
>>
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>> Caused by: java.lang.reflect.InvocationTargetException
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
>> rAccessorImpl.java:39)
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
>> nstructorAccessorImpl.java:27)
>>
>>         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>
>>         at
>> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>> 76)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:102)
>>
>>         ... 12 more
>>
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
>>
>>         at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
>>
>>         at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Ca
>> ssandra.java:24338)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Ca
>> ssandra.java:1371)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassand
>> ra.java:1346)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:192)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:187)
>>
>>         at
>> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operati
>> on.java:101)
>>
>>         at
>> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail
>> over(HConnectionManager.java:232)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(Abst
>> ractCluster.java:201)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(Cassandr
>> aClient.java:82)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.j
>> ava:69)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.j
>> ava:68)
>>
>>         ... 18 more
>



--
Lewis

Re: Nutch 2 and Cassandra

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi

I've been watching progress on this thread with interest and think that this
would be a great addition to the wiki under the following page [1]

I am happy to write it up, however is there anything else we need to be
aware of in addition to the material you have provided, for example some
latent info that has been assumed or not been explained.

Thank you

[1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2

On Tue, Aug 2, 2011 at 6:32 PM, Tom Davidson <td...@covario.com> wrote:

> I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar
> with an older thrift library in it. I removed the jar from my classpath and
> all is good. Thanks for your help.
>
> -----Original Message-----
> From: Tom Davidson [mailto:tdavidson@covario.com]
> Sent: Monday, August 01, 2011 3:29 PM
> To: dev@nutch.apache.org
> Subject: RE: Nutch 2 and Cassandra
>
> OK... Are you running with a clustered version of Hadoop? I think you have
> to have your HADOOP_HOME env variable set. Otherwise it runs in local mode.
> I have been able to run in local mode, but not in deployed mode.
>
>
> -----Original Message-----
> From: Alexis [mailto:alexis.detreglode@gmail.com]
> Sent: Monday, August 01, 2011 3:25 PM
> To: dev@nutch.apache.org
> Subject: Re: Nutch 2 and Cassandra
>
> Ok this version of hector was properly resolved. Thanks!
>
> These are the logs:
> ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
> ~/java/workspace/Nutch/seeds
> 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
> 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
> /home/alex/java/workspace/Nutch/seeds
> 11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
> Host Retry service started with queue size -1 and retry delay 10s
> 11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Test
> Cluster:ServiceType=hector,MonitorType=hector
> 11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
> cluster 'Test Cluster' was created on host 'localhost'
> 11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process
> : 1
> 11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
> 11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process
> : 1
> 11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
> gora.buffer.write.limit = 10000
> 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
> /tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
> 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         the nutch core
> extension points (nutch-extensionpoints)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic URL
> Normalizer (urlnormalizer-basic)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic Indexing
> Filter (index-basic)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Html Parse
> Plug-in (parse-html)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         HTTP Framework
> (lib-http)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Pass-through
> URL Normalizer (urlnormalizer-pass)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
> Filter (urlfilter-regex)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Http Protocol
> Plug-in (protocol-http)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
> Normalizer (urlnormalizer-regex)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Tika Parser
> Plug-in (parse-tika)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         OPIC Scoring
> Plug-in (scoring-opic)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         CyberNeko HTML
> Parser (lib-nekohtml)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Anchor
> Indexing Filter (index-anchor)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
> Filter Framework (lib-regex-filter)
> 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Parse Filter
> (org.apache.nutch.parse.ParseFilter)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 11/08/01 15:17:50 INFO conf.Configuration: found resource
> regex-normalize.xml at
> file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
> 11/08/01 15:17:50 INFO conf.Configuration: found resource
> regex-urlfilter.txt at
> file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
> 11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
> scope 'inject', using default
> 11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
> 11/08/01 15:17:51 INFO mapred.TaskRunner:
> Task:attempt_local_0001_m_000000_0 is done. And is in the process of
> commiting
> 11/08/01 15:17:51 INFO mapred.LocalJobRunner:
> 11/08/01 15:17:51 INFO mapred.TaskRunner: Task
> 'attempt_local_0001_m_000000_0' done.
> 11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
> 11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
> 11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
> 11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
> 11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_READ=44872735
> 11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45245279
> 11/08/01 15:17:52 INFO mapred.JobClient:   Map-Reduce Framework
> 11/08/01 15:17:52 INFO mapred.JobClient:     Map input records=3
> 11/08/01 15:17:52 INFO mapred.JobClient:     Spilled Records=0
> 11/08/01 15:17:52 INFO mapred.JobClient:     Map output records=3
> 11/08/01 15:17:52 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
> with processName=JobTracker, sessionId= - already initialized
> 11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished
>
>
>
> This is what was added to ivy/ivy.xml:
>
> +       <dependency org="org.apache.gora" name="gora-cassandra"
> rev="0.2-incubating" conf="*->compile"/>
> +       <dependency org="org.apache.cassandra" name="cassandra-thrift"
> rev="0.8.1"/>
> +       <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> conf="*->*,!javadoc,!sources"/>
> +       <dependency org="com.github.stephenc.high-scale-lib"
> name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
> +       <dependency org="com.google.collections"
> name="google-collections" rev="1.0" conf="*->*,!javadoc,!sources"/>
> +       <dependency org="com.google.guava" name="guava" rev="r09"
> conf="*->*,!javadoc,!sources"/>
> +       <dependency org="org.apache.cassandra" name="apache-cassandra"
> rev="0.8.1"/>
> +       <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"/>
>
>
>
> On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson <td...@covario.com>
> wrote:
> > I did something similar to below to add the Cassandra dependencies. Note
> that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you
> add the hector jars to your nutch job jar and see what you get? I think I am
> one step ahead of you. BTW, I just added this line to get the hector
> dependency:
> >
> >        <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"
> conf="*->default"/>
> >
> > -----Original Message-----
> > From: Alexis [mailto:alexis.detreglode@gmail.com]
> > Sent: Monday, August 01, 2011 2:28 PM
> > To: dev@nutch.apache.org
> > Subject: Re: Nutch 2 and Cassandra
> >
> > Hi, libthrift is a dependency of cassandra-thrift, as listed here:
> >
> http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
> >
> > During Nutch build, you have to manually tweak the Ivy configuration
> depending on your choice of the Gora store, in this case Cassandra.
> > Basically you need to add all the dependencies listed there:
> >
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup
> >
> > Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies
> and then let's rebuild Nutch (see attached patch):
> >        <dependency org="org.apache.gora" name="gora-cassandra"
> > rev="0.2-incubating" conf="*->compile"/>
> >        <dependency org="org.apache.cassandra" name="cassandra-thrift"
> rev="0.8.1"/>
> >        <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> > conf="*->*,!javadoc,!sources"/>
> >        <dependency org="com.github.stephenc.high-scale-lib"
> > name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
> >        <dependency org="com.google.collections" name="google-collections"
> > rev="1.0" conf="*->*,!javadoc,!sources"/>
> >        <dependency org="com.google.guava" name="guava" rev="r09"
> > conf="*->*,!javadoc,!sources"/>
> >
> > $ ant clean
> > $ ant
> >
> > In your case libthrift should now be downloaded by Ivy and then bundled
> into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and
> hector got included in your classpath...
> >
> > Somehow we need to resolve as well:
> >        <dependency org="org.apache.cassandra" name="apache-cassandra"
> > rev="0.8.1"/>
> >        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>
> >
> > I don't think the following 2 jars are in the default maven repository so
> they won't be downloaded, that's why they were commented in the Gora
> Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)
> >
> >
> > Since hector jar is not found in my case I get:
> > ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
> ~/java/workspace/Nutch/seeds
> > 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
> > 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
> > /home/alex/java/workspace/Nutch/seeds
> > 11/08/01 14:18:42 INFO security.Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> > cacheTimeout=300000
> > 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> > 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
> > org.apache.gora.util.GoraException:
> > java.lang.reflect.InvocationTargetException
> >        at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
> >        at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
> >        at
> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
> >        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
> >        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
> >        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
> >        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
> > Caused by: java.lang.reflect.InvocationTargetException
> >        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> >        at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> >        at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> >        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >        at
> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
> >        at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
> >        ... 12 more
> > Caused by: java.lang.NoClassDefFoundError:
> me/prettyprint/hector/api/Serializer
> >        at
> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
> >        ... 18 more
> > Caused by: java.lang.ClassNotFoundException:
> > me.prettyprint.hector.api.Serializer
> >        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> >        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> >        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> >        ... 19 more
> >
> >
> >
> >
> > On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com>
> wrote:
> >> Hi All,
> >>
> >>
> >>
> >> I am kind of at my wit's end here, so I am hoping someone here can
> >> help.  I am trying to use Nutch2 and Cassandra and I have been
> >> successful using the runtime/local build. I am using the Cloudera CDH3
> >> on CentOs 5 and I do not want to contaminate by hadoop install by
> >> dropping in a bunch of Nutch jars, etc. So I am trying to use the
> >> nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get
> >> the error below.  I have double and triple checked the classpath and
> >> the included jars and the only jar that contains FieldValueMetaData is
> >> the libthrift-0.6.1.jar which has the method that is claimed to be
> missing. Any ideas?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Tom
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
> >>
> >> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
> >> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs
> >> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
> >> -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console
> >> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
> >> -Dhadoop.policy.file=hadoop-policy.xml -classpath
> >> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hado
> >> op-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/ha
> >> doop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt
> >> -1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/ha
> >> doop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-cod
> >> ec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/
> >> hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-ht
> >> tpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:
> >> /usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop
> >> -0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.ja
> >> r:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr
> >> /lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-
> >> plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5
> >> .2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/
> >> hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/ja
> >> sper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr
> >> /lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-s
> >> ervlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.ja
> >> r:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/ju
> >> nit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.2
> >> 0/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:
> >> /usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servle
> >> t-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14
> >> .jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20
> >> /lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:
> >> /usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/
> >> jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar
> >> /home/SEMDIRECTOR/tdavidson/nutch-2.job
> >> org.apache.nutch.crawl.InjectorJob urls
> >>
> >> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
> >>
> >> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
> >>
> >> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed
> >> Host Retry service started with queue size -1 and retry delay 10s
> >>
> >> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX
> >> me.prettyprint.cassandra.service_Test
> >> Cluster:ServiceType=hector,MonitorType=hector
> >>
> >> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
> >> org.apache.gora.util.GoraException:
> >> java.lang.reflect.InvocationTargetException
> >>
> >>         at
> >> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> >> y.java:110)
> >>
> >>         at
> >> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> >> y.java:93)
> >>
> >>         at
> >> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java
> >> :59)
> >>
> >>         at
> >> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
> >>
> >>         at
> >> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
> >>
> >>         at
> >> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
> >>
> >>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>
> >>         at
> >> org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
> >>
> >>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >>         at
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> >> ava:39)
> >>
> >>         at
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> >> orImpl.java:25)
> >>
> >>         at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >>
> >> Caused by: java.lang.reflect.InvocationTargetException
> >>
> >>         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> Method)
> >>
> >>         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
> >> rAccessorImpl.java:39)
> >>
> >>         at
> >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
> >> nstructorAccessorImpl.java:27)
> >>
> >>         at
> >> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >>
> >>         at
> >> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> >> 76)
> >>
> >>         at
> >> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> >> y.java:102)
> >>
> >>         ... 12 more
> >>
> >> Caused by: java.lang.NoSuchMethodError:
> >> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
> >>
> >>         at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
> >>
> >>         at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
> >>
> >>         at
> >> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Ca
> >> ssandra.java:24338)
> >>
> >>         at
> >> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Ca
> >> ssandra.java:1371)
> >>
> >>         at
> >> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassand
> >> ra.java:1346)
> >>
> >>         at
> >> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
> >> ster.java:192)
> >>
> >>         at
> >> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
> >> ster.java:187)
> >>
> >>         at
> >> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operati
> >> on.java:101)
> >>
> >>         at
> >> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail
> >> over(HConnectionManager.java:232)
> >>
> >>         at
> >> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(Abst
> >> ractCluster.java:201)
> >>
> >>         at
> >> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(Cassandr
> >> aClient.java:82)
> >>
> >>         at
> >> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.j
> >> ava:69)
> >>
> >>         at
> >> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.j
> >> ava:68)
> >>
> >>         ... 18 more
> >
>



-- 
*Lewis*

RE: Nutch 2 and Cassandra

Posted by Tom Davidson <td...@covario.com>.
I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar with an older thrift library in it. I removed the jar from my classpath and all is good. Thanks for your help.

-----Original Message-----
From: Tom Davidson [mailto:tdavidson@covario.com] 
Sent: Monday, August 01, 2011 3:29 PM
To: dev@nutch.apache.org
Subject: RE: Nutch 2 and Cassandra

OK... Are you running with a clustered version of Hadoop? I think you have to have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have been able to run in local mode, but not in deployed mode.


-----Original Message-----
From: Alexis [mailto:alexis.detreglode@gmail.com] 
Sent: Monday, August 01, 2011 3:25 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 10000
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository:         the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository:         HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository:         OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_000000_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO mapred.LocalJobRunner:
11/08/01 15:17:51 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_000000_0' done.
11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_READ=44872735
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45245279
11/08/01 15:17:52 INFO mapred.JobClient:   Map-Reduce Framework
11/08/01 15:17:52 INFO mapred.JobClient:     Map input records=3
11/08/01 15:17:52 INFO mapred.JobClient:     Spilled Records=0
11/08/01 15:17:52 INFO mapred.JobClient:     Map output records=3
11/08/01 15:17:52 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished



This is what was added to ivy/ivy.xml:

+       <dependency org="org.apache.gora" name="gora-cassandra"
rev="0.2-incubating" conf="*->compile"/>
+       <dependency org="org.apache.cassandra" name="cassandra-thrift"
rev="0.8.1"/>
+       <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.github.stephenc.high-scale-lib"
name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.collections"
name="google-collections" rev="1.0" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.guava" name="guava" rev="r09"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="org.apache.cassandra" name="apache-cassandra"
rev="0.8.1"/>
+       <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"/>



On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson <td...@covario.com> wrote:
> I did something similar to below to add the Cassandra dependencies. Note that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the hector jars to your nutch job jar and see what you get? I think I am one step ahead of you. BTW, I just added this line to get the hector dependency:
>
>        <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2" conf="*->default"/>
>
> -----Original Message-----
> From: Alexis [mailto:alexis.detreglode@gmail.com]
> Sent: Monday, August 01, 2011 2:28 PM
> To: dev@nutch.apache.org
> Subject: Re: Nutch 2 and Cassandra
>
> Hi, libthrift is a dependency of cassandra-thrift, as listed here:
> http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
>
> During Nutch build, you have to manually tweak the Ivy configuration depending on your choice of the Gora store, in this case Cassandra.
> Basically you need to add all the dependencies listed there:
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup
>
> Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then let's rebuild Nutch (see attached patch):
>        <dependency org="org.apache.gora" name="gora-cassandra"
> rev="0.2-incubating" conf="*->compile"/>
>        <dependency org="org.apache.cassandra" name="cassandra-thrift" rev="0.8.1"/>
>        <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.github.stephenc.high-scale-lib"
> name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.collections" name="google-collections"
> rev="1.0" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.guava" name="guava" rev="r09"
> conf="*->*,!javadoc,!sources"/>
>
> $ ant clean
> $ ant
>
> In your case libthrift should now be downloaded by Ivy and then bundled into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got included in your classpath...
>
> Somehow we need to resolve as well:
>        <dependency org="org.apache.cassandra" name="apache-cassandra"
> rev="0.8.1"/>
>        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>
>
> I don't think the following 2 jars are in the default maven repository so they won't be downloaded, that's why they were commented in the Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)
>
>
> Since hector jar is not found in my case I get:
> ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
> /home/alex/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
> org.apache.gora.util.GoraException:
> java.lang.reflect.InvocationTargetException
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
>        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
> Caused by: java.lang.reflect.InvocationTargetException
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
>        ... 12 more
> Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
>        at org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
>        ... 18 more
> Caused by: java.lang.ClassNotFoundException:
> me.prettyprint.hector.api.Serializer
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>        ... 19 more
>
>
>
>
> On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com> wrote:
>> Hi All,
>>
>>
>>
>> I am kind of at my wit's end here, so I am hoping someone here can
>> help.  I am trying to use Nutch2 and Cassandra and I have been
>> successful using the runtime/local build. I am using the Cloudera CDH3
>> on CentOs 5 and I do not want to contaminate by hadoop install by
>> dropping in a bunch of Nutch jars, etc. So I am trying to use the
>> nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get
>> the error below.  I have double and triple checked the classpath and
>> the included jars and the only jar that contains FieldValueMetaData is
>> the libthrift-0.6.1.jar which has the method that is claimed to be missing. Any ideas?
>>
>>
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>>
>> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
>> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs
>> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
>> -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console
>> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
>> -Dhadoop.policy.file=hadoop-policy.xml -classpath
>> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hado
>> op-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/ha
>> doop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt
>> -1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/ha
>> doop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-cod
>> ec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/
>> hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-ht
>> tpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:
>> /usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop
>> -0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.ja
>> r:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr
>> /lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-
>> plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5
>> .2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/
>> hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/ja
>> sper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr
>> /lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-s
>> ervlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.ja
>> r:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/ju
>> nit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.2
>> 0/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:
>> /usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servle
>> t-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14
>> .jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20
>> /lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:
>> /usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/
>> jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar
>> /home/SEMDIRECTOR/tdavidson/nutch-2.job
>> org.apache.nutch.crawl.InjectorJob urls
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
>>
>> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed
>> Host Retry service started with queue size -1 and retry delay 10s
>>
>> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX
>> me.prettyprint.cassandra.service_Test
>> Cluster:ServiceType=hector,MonitorType=hector
>>
>> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
>> org.apache.gora.util.GoraException:
>> java.lang.reflect.InvocationTargetException
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:110)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:93)
>>
>>         at
>> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java
>> :59)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>>
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> ava:39)
>>
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> orImpl.java:25)
>>
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>> Caused by: java.lang.reflect.InvocationTargetException
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
>> rAccessorImpl.java:39)
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
>> nstructorAccessorImpl.java:27)
>>
>>         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>
>>         at
>> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>> 76)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:102)
>>
>>         ... 12 more
>>
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
>>
>>         at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
>>
>>         at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Ca
>> ssandra.java:24338)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Ca
>> ssandra.java:1371)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassand
>> ra.java:1346)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:192)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:187)
>>
>>         at
>> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operati
>> on.java:101)
>>
>>         at
>> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail
>> over(HConnectionManager.java:232)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(Abst
>> ractCluster.java:201)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(Cassandr
>> aClient.java:82)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.j
>> ava:69)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.j
>> ava:68)
>>
>>         ... 18 more
>

RE: Nutch 2 and Cassandra

Posted by Tom Davidson <td...@covario.com>.
OK... Are you running with a clustered version of Hadoop? I think you have to have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have been able to run in local mode, but not in deployed mode.


-----Original Message-----
From: Alexis [mailto:alexis.detreglode@gmail.com] 
Sent: Monday, August 01, 2011 3:25 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 10000
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository:         the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository:         HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository:         OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_000000_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO mapred.LocalJobRunner:
11/08/01 15:17:51 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_000000_0' done.
11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_READ=44872735
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45245279
11/08/01 15:17:52 INFO mapred.JobClient:   Map-Reduce Framework
11/08/01 15:17:52 INFO mapred.JobClient:     Map input records=3
11/08/01 15:17:52 INFO mapred.JobClient:     Spilled Records=0
11/08/01 15:17:52 INFO mapred.JobClient:     Map output records=3
11/08/01 15:17:52 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished



This is what was added to ivy/ivy.xml:

+       <dependency org="org.apache.gora" name="gora-cassandra"
rev="0.2-incubating" conf="*->compile"/>
+       <dependency org="org.apache.cassandra" name="cassandra-thrift"
rev="0.8.1"/>
+       <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.github.stephenc.high-scale-lib"
name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.collections"
name="google-collections" rev="1.0" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.guava" name="guava" rev="r09"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="org.apache.cassandra" name="apache-cassandra"
rev="0.8.1"/>
+       <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"/>



On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson <td...@covario.com> wrote:
> I did something similar to below to add the Cassandra dependencies. Note that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the hector jars to your nutch job jar and see what you get? I think I am one step ahead of you. BTW, I just added this line to get the hector dependency:
>
>        <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2" conf="*->default"/>
>
> -----Original Message-----
> From: Alexis [mailto:alexis.detreglode@gmail.com]
> Sent: Monday, August 01, 2011 2:28 PM
> To: dev@nutch.apache.org
> Subject: Re: Nutch 2 and Cassandra
>
> Hi, libthrift is a dependency of cassandra-thrift, as listed here:
> http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
>
> During Nutch build, you have to manually tweak the Ivy configuration depending on your choice of the Gora store, in this case Cassandra.
> Basically you need to add all the dependencies listed there:
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup
>
> Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then let's rebuild Nutch (see attached patch):
>        <dependency org="org.apache.gora" name="gora-cassandra"
> rev="0.2-incubating" conf="*->compile"/>
>        <dependency org="org.apache.cassandra" name="cassandra-thrift" rev="0.8.1"/>
>        <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.github.stephenc.high-scale-lib"
> name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.collections" name="google-collections"
> rev="1.0" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.guava" name="guava" rev="r09"
> conf="*->*,!javadoc,!sources"/>
>
> $ ant clean
> $ ant
>
> In your case libthrift should now be downloaded by Ivy and then bundled into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got included in your classpath...
>
> Somehow we need to resolve as well:
>        <dependency org="org.apache.cassandra" name="apache-cassandra"
> rev="0.8.1"/>
>        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>
>
> I don't think the following 2 jars are in the default maven repository so they won't be downloaded, that's why they were commented in the Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)
>
>
> Since hector jar is not found in my case I get:
> ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
> /home/alex/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
> org.apache.gora.util.GoraException:
> java.lang.reflect.InvocationTargetException
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
>        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
> Caused by: java.lang.reflect.InvocationTargetException
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
>        ... 12 more
> Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
>        at org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
>        ... 18 more
> Caused by: java.lang.ClassNotFoundException:
> me.prettyprint.hector.api.Serializer
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>        ... 19 more
>
>
>
>
> On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com> wrote:
>> Hi All,
>>
>>
>>
>> I am kind of at my wit's end here, so I am hoping someone here can
>> help.  I am trying to use Nutch2 and Cassandra and I have been
>> successful using the runtime/local build. I am using the Cloudera CDH3
>> on CentOs 5 and I do not want to contaminate by hadoop install by
>> dropping in a bunch of Nutch jars, etc. So I am trying to use the
>> nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get
>> the error below.  I have double and triple checked the classpath and
>> the included jars and the only jar that contains FieldValueMetaData is
>> the libthrift-0.6.1.jar which has the method that is claimed to be missing. Any ideas?
>>
>>
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>>
>> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
>> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs
>> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
>> -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console
>> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
>> -Dhadoop.policy.file=hadoop-policy.xml -classpath
>> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hado
>> op-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/ha
>> doop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt
>> -1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/ha
>> doop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-cod
>> ec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/
>> hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-ht
>> tpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:
>> /usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop
>> -0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.ja
>> r:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr
>> /lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-
>> plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5
>> .2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/
>> hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/ja
>> sper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr
>> /lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-s
>> ervlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.ja
>> r:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/ju
>> nit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.2
>> 0/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:
>> /usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servle
>> t-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14
>> .jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20
>> /lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:
>> /usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/
>> jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar
>> /home/SEMDIRECTOR/tdavidson/nutch-2.job
>> org.apache.nutch.crawl.InjectorJob urls
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
>>
>> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed
>> Host Retry service started with queue size -1 and retry delay 10s
>>
>> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX
>> me.prettyprint.cassandra.service_Test
>> Cluster:ServiceType=hector,MonitorType=hector
>>
>> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
>> org.apache.gora.util.GoraException:
>> java.lang.reflect.InvocationTargetException
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:110)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:93)
>>
>>         at
>> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java
>> :59)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>>
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> ava:39)
>>
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> orImpl.java:25)
>>
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>> Caused by: java.lang.reflect.InvocationTargetException
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
>> rAccessorImpl.java:39)
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
>> nstructorAccessorImpl.java:27)
>>
>>         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>
>>         at
>> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>> 76)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:102)
>>
>>         ... 12 more
>>
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
>>
>>         at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
>>
>>         at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Ca
>> ssandra.java:24338)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Ca
>> ssandra.java:1371)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassand
>> ra.java:1346)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:192)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:187)
>>
>>         at
>> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operati
>> on.java:101)
>>
>>         at
>> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail
>> over(HConnectionManager.java:232)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(Abst
>> ractCluster.java:201)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(Cassandr
>> aClient.java:82)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.j
>> ava:69)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.j
>> ava:68)
>>
>>         ... 18 more
>

Re: Nutch 2 and Cassandra

Posted by Alexis <al...@gmail.com>.
Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 10000
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository:         the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository:         HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository:         OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository:         CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository:         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_000000_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO mapred.LocalJobRunner:
11/08/01 15:17:51 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_000000_0' done.
11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_READ=44872735
11/08/01 15:17:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45245279
11/08/01 15:17:52 INFO mapred.JobClient:   Map-Reduce Framework
11/08/01 15:17:52 INFO mapred.JobClient:     Map input records=3
11/08/01 15:17:52 INFO mapred.JobClient:     Spilled Records=0
11/08/01 15:17:52 INFO mapred.JobClient:     Map output records=3
11/08/01 15:17:52 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished



This is what was added to ivy/ivy.xml:

+       <dependency org="org.apache.gora" name="gora-cassandra"
rev="0.2-incubating" conf="*->compile"/>
+       <dependency org="org.apache.cassandra" name="cassandra-thrift"
rev="0.8.1"/>
+       <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.github.stephenc.high-scale-lib"
name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.collections"
name="google-collections" rev="1.0" conf="*->*,!javadoc,!sources"/>
+       <dependency org="com.google.guava" name="guava" rev="r09"
conf="*->*,!javadoc,!sources"/>
+       <dependency org="org.apache.cassandra" name="apache-cassandra"
rev="0.8.1"/>
+       <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"/>



On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson <td...@covario.com> wrote:
> I did something similar to below to add the Cassandra dependencies. Note that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the hector jars to your nutch job jar and see what you get? I think I am one step ahead of you. BTW, I just added this line to get the hector dependency:
>
>        <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2" conf="*->default"/>
>
> -----Original Message-----
> From: Alexis [mailto:alexis.detreglode@gmail.com]
> Sent: Monday, August 01, 2011 2:28 PM
> To: dev@nutch.apache.org
> Subject: Re: Nutch 2 and Cassandra
>
> Hi, libthrift is a dependency of cassandra-thrift, as listed here:
> http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
>
> During Nutch build, you have to manually tweak the Ivy configuration depending on your choice of the Gora store, in this case Cassandra.
> Basically you need to add all the dependencies listed there:
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup
>
> Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then let's rebuild Nutch (see attached patch):
>        <dependency org="org.apache.gora" name="gora-cassandra"
> rev="0.2-incubating" conf="*->compile"/>
>        <dependency org="org.apache.cassandra" name="cassandra-thrift" rev="0.8.1"/>
>        <dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
> conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.github.stephenc.high-scale-lib"
> name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.collections" name="google-collections"
> rev="1.0" conf="*->*,!javadoc,!sources"/>
>        <dependency org="com.google.guava" name="guava" rev="r09"
> conf="*->*,!javadoc,!sources"/>
>
> $ ant clean
> $ ant
>
> In your case libthrift should now be downloaded by Ivy and then bundled into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got included in your classpath...
>
> Somehow we need to resolve as well:
>        <dependency org="org.apache.cassandra" name="apache-cassandra"
> rev="0.8.1"/>
>        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>
>
> I don't think the following 2 jars are in the default maven repository so they won't be downloaded, that's why they were commented in the Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)
>
>
> Since hector jar is not found in my case I get:
> ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
> /home/alex/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
> org.apache.gora.util.GoraException:
> java.lang.reflect.InvocationTargetException
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
>        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
> Caused by: java.lang.reflect.InvocationTargetException
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
>        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
>        ... 12 more
> Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
>        at org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
>        ... 18 more
> Caused by: java.lang.ClassNotFoundException:
> me.prettyprint.hector.api.Serializer
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>        ... 19 more
>
>
>
>
> On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com> wrote:
>> Hi All,
>>
>>
>>
>> I am kind of at my wit's end here, so I am hoping someone here can
>> help.  I am trying to use Nutch2 and Cassandra and I have been
>> successful using the runtime/local build. I am using the Cloudera CDH3
>> on CentOs 5 and I do not want to contaminate by hadoop install by
>> dropping in a bunch of Nutch jars, etc. So I am trying to use the
>> nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get
>> the error below.  I have double and triple checked the classpath and
>> the included jars and the only jar that contains FieldValueMetaData is
>> the libthrift-0.6.1.jar which has the method that is claimed to be missing. Any ideas?
>>
>>
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>>
>> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
>> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs
>> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
>> -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console
>> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
>> -Dhadoop.policy.file=hadoop-policy.xml -classpath
>> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hado
>> op-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/ha
>> doop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt
>> -1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/ha
>> doop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-cod
>> ec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/
>> hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-ht
>> tpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:
>> /usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop
>> -0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.ja
>> r:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr
>> /lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-
>> plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5
>> .2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/
>> hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/ja
>> sper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr
>> /lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-s
>> ervlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.ja
>> r:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/ju
>> nit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.2
>> 0/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:
>> /usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servle
>> t-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14
>> .jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20
>> /lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:
>> /usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/
>> jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar
>> /home/SEMDIRECTOR/tdavidson/nutch-2.job
>> org.apache.nutch.crawl.InjectorJob urls
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
>>
>> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
>>
>> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed
>> Host Retry service started with queue size -1 and retry delay 10s
>>
>> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX
>> me.prettyprint.cassandra.service_Test
>> Cluster:ServiceType=hector,MonitorType=hector
>>
>> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
>> org.apache.gora.util.GoraException:
>> java.lang.reflect.InvocationTargetException
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:110)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:93)
>>
>>         at
>> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java
>> :59)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>>
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>>         at
>> org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> ava:39)
>>
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> orImpl.java:25)
>>
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>> Caused by: java.lang.reflect.InvocationTargetException
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
>> rAccessorImpl.java:39)
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
>> nstructorAccessorImpl.java:27)
>>
>>         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>
>>         at
>> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>> 76)
>>
>>         at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
>> y.java:102)
>>
>>         ... 12 more
>>
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
>>
>>         at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
>>
>>         at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Ca
>> ssandra.java:24338)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Ca
>> ssandra.java:1371)
>>
>>         at
>> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassand
>> ra.java:1346)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:192)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
>> ster.java:187)
>>
>>         at
>> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operati
>> on.java:101)
>>
>>         at
>> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail
>> over(HConnectionManager.java:232)
>>
>>         at
>> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(Abst
>> ractCluster.java:201)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(Cassandr
>> aClient.java:82)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.j
>> ava:69)
>>
>>         at
>> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.j
>> ava:68)
>>
>>         ... 18 more
>

RE: Nutch 2 and Cassandra

Posted by Tom Davidson <td...@covario.com>.
I did something similar to below to add the Cassandra dependencies. Note that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the hector jars to your nutch job jar and see what you get? I think I am one step ahead of you. BTW, I just added this line to get the hector dependency:

        <dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2" conf="*->default"/>

-----Original Message-----
From: Alexis [mailto:alexis.detreglode@gmail.com] 
Sent: Monday, August 01, 2011 2:28 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Hi, libthrift is a dependency of cassandra-thrift, as listed here:
http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1

During Nutch build, you have to manually tweak the Ivy configuration depending on your choice of the Gora store, in this case Cassandra.
Basically you need to add all the dependencies listed there:
http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup

Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then let's rebuild Nutch (see attached patch):
	<dependency org="org.apache.gora" name="gora-cassandra"
rev="0.2-incubating" conf="*->compile"/>
	<dependency org="org.apache.cassandra" name="cassandra-thrift" rev="0.8.1"/>
	<dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
conf="*->*,!javadoc,!sources"/>
	<dependency org="com.github.stephenc.high-scale-lib"
name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
	<dependency org="com.google.collections" name="google-collections"
rev="1.0" conf="*->*,!javadoc,!sources"/>
	<dependency org="com.google.guava" name="guava" rev="r09"
conf="*->*,!javadoc,!sources"/>

$ ant clean
$ ant

In your case libthrift should now be downloaded by Ivy and then bundled into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got included in your classpath...

Somehow we need to resolve as well:
        <dependency org="org.apache.cassandra" name="apache-cassandra"
rev="0.8.1"/>
        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>

I don't think the following 2 jars are in the default maven repository so they won't be downloaded, that's why they were commented in the Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)


Since hector jar is not found in my case I get:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=300000
11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
org.apache.gora.util.GoraException:
java.lang.reflect.InvocationTargetException
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
        ... 12 more
Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
        at org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
        ... 18 more
Caused by: java.lang.ClassNotFoundException:
me.prettyprint.hector.api.Serializer
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 19 more




On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com> wrote:
> Hi All,
>
>
>
> I am kind of at my wit's end here, so I am hoping someone here can 
> help.  I am trying to use Nutch2 and Cassandra and I have been 
> successful using the runtime/local build. I am using the Cloudera CDH3 
> on CentOs 5 and I do not want to contaminate by hadoop install by 
> dropping in a bunch of Nutch jars, etc. So I am trying to use the 
> nutch-2-dev.job jar. When I try to use the nutch2-dev.job jar, I get 
> the error below.  I have double and triple checked the classpath and 
> the included jars and the only jar that contains FieldValueMetaData is 
> the libthrift-0.6.1.jar which has the method that is claimed to be missing. Any ideas?
>
>
>
> Thanks,
>
> Tom
>
>
>
>
>
>
>
>
>
> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>
> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m 
> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs 
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20 
> -Dhadoop.id.str=tdavidson -Dhadoop.root.logger=INFO,console
> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
> -Dhadoop.policy.file=hadoop-policy.xml -classpath 
> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hado
> op-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/ha
> doop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt
> -1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/ha
> doop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-cod
> ec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/
> hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-ht
> tpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:
> /usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop
> -0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.ja
> r:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr
> /lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-
> plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5
> .2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/
> hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/ja
> sper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr
> /lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-s
> ervlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.ja
> r:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/ju
> nit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.2
> 0/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:
> /usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servle
> t-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14
> .jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20
> /lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:
> /usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/
> jsp-2.1/jsp-api-2.1.jar org.apache.hadoop.util.RunJar 
> /home/SEMDIRECTOR/tdavidson/nutch-2.job
> org.apache.nutch.crawl.InjectorJob urls
>
> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
>
> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
>
> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed 
> Host Retry service started with queue size -1 and retry delay 10s
>
> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX 
> me.prettyprint.cassandra.service_Test
> Cluster:ServiceType=hector,MonitorType=hector
>
> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
> org.apache.gora.util.GoraException:
> java.lang.reflect.InvocationTargetException
>
>         at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> y.java:110)
>
>         at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> y.java:93)
>
>         at
> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java
> :59)
>
>         at 
> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>
>         at 
> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>
>         at 
> org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
>         at 
> org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:39)
>
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:25)
>
>         at java.lang.reflect.Method.invoke(Method.java:597)
>
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
> Caused by: java.lang.reflect.InvocationTargetException
>
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
> rAccessorImpl.java:39)
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
> nstructorAccessorImpl.java:27)
>
>         at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>
>         at
> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> 76)
>
>         at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactor
> y.java:102)
>
>         ... 12 more
>
> Caused by: java.lang.NoSuchMethodError:
> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
>
>         at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
>
>         at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
>
>         at
> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Ca
> ssandra.java:24338)
>
>         at
> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Ca
> ssandra.java:1371)
>
>         at
> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassand
> ra.java:1346)
>
>         at
> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
> ster.java:192)
>
>         at
> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractClu
> ster.java:187)
>
>         at
> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operati
> on.java:101)
>
>         at
> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFail
> over(HConnectionManager.java:232)
>
>         at
> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(Abst
> ractCluster.java:201)
>
>         at
> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(Cassandr
> aClient.java:82)
>
>         at
> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.j
> ava:69)
>
>         at
> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.j
> ava:68)
>
>         ... 18 more

Re: Nutch 2 and Cassandra

Posted by Alexis <al...@gmail.com>.
Hi, libthrift is a dependency of cassandra-thrift, as listed here:
http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1

During Nutch build, you have to manually tweak the Ivy configuration
depending on your choice of the Gora store, in this case Cassandra.
Basically you need to add all the dependencies listed there:
http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup

Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies
and then let's rebuild Nutch (see attached patch):
	<dependency org="org.apache.gora" name="gora-cassandra"
rev="0.2-incubating" conf="*->compile"/>
	<dependency org="org.apache.cassandra" name="cassandra-thrift" rev="0.8.1"/>
	<dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9"
conf="*->*,!javadoc,!sources"/>
	<dependency org="com.github.stephenc.high-scale-lib"
name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
	<dependency org="com.google.collections" name="google-collections"
rev="1.0" conf="*->*,!javadoc,!sources"/>
	<dependency org="com.google.guava" name="guava" rev="r09"
conf="*->*,!javadoc,!sources"/>

$ ant clean
$ ant

In your case libthrift should now be downloaded by Ivy and then
bundled into the nutch-2.0-dev.job file. I'm not sure how
apache-cassandra and hector got included in your classpath...

Somehow we need to resolve as well:
        <dependency org="org.apache.cassandra" name="apache-cassandra"
rev="0.8.1"/>
        <dependency org="me.prettyprint" name="hector" rev="0.8.0-1"/>

I don't think the following 2 jars are in the default maven repository
so they won't be downloaded, that's why they were commented in the
Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)


Since hector jar is not found in my case I get:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=300000
11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
org.apache.gora.util.GoraException:
java.lang.reflect.InvocationTargetException
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
        ... 12 more
Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
        at org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:60)
        ... 18 more
Caused by: java.lang.ClassNotFoundException:
me.prettyprint.hector.api.Serializer
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 19 more




On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson <td...@covario.com> wrote:
> Hi All,
>
>
>
> I am kind of at my wit�s end here, so I am hoping someone here can help.� I
> am trying to use Nutch2 and Cassandra and I have been successful using the
> runtime/local build. I am using the Cloudera CDH3 on CentOs 5 and I do not
> want to contaminate by hadoop install by dropping in a bunch of Nutch jars,
> etc. So I am trying to use the nutch-2-dev.job jar. When I try to use the
> nutch2-dev.job jar, I get the error below.� I have double and triple checked
> the classpath and the included jars and the only jar that contains
> FieldValueMetaData is the libthrift-0.6.1.jar which has the method that is
> claimed to be missing. Any ideas?
>
>
>
> Thanks,
>
> Tom
>
>
>
>
>
>
>
>
>
> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>
> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs -Dhadoop.log.file=hadoop.log
> -Dhadoop.home.dir=/usr/lib/hadoop-0.20 -Dhadoop.id.str=tdavidson
> -Dhadoop.root.logger=INFO,console
> -Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
> -Dhadoop.policy.file=hadoop-policy.xml -classpath
> /usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar
> org.apache.hadoop.util.RunJar /home/SEMDIRECTOR/tdavidson/nutch-2.job
> org.apache.nutch.crawl.InjectorJob urls
>
> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
>
> 11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
>
> 11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed Host
> Retry service started with queue size -1 and retry delay 10s
>
> 11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Test
> Cluster:ServiceType=hector,MonitorType=hector
>
> 11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob:
> org.apache.gora.util.GoraException:
> java.lang.reflect.InvocationTargetException
>
> ������� at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
>
> ������� at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
>
> ������� at
> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
>
> ������� at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>
> ������� at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>
> ������� at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>
> ������� at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
> ������� at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>
> ������� at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> ������� at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>
> ������� at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
> ������� at java.lang.reflect.Method.invoke(Method.java:597)
>
> ������� at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
> Caused by: java.lang.reflect.InvocationTargetException
>
> ������� at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> ������� at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>
> ������� at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>
> ������� at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>
> ������� at
> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
>
> ������� at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
>
> ������� ... 12 more
>
> Caused by: java.lang.NoSuchMethodError:
> org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
>
> ������� at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
>
> ������� at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
>
> ������� at
> org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Cassandra.java:24338)
>
> ������� at
> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Cassandra.java:1371)
>
> ������� at
> org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassandra.java:1346)
>
> ������� at
> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractCluster.java:192)
>
> ������� at
> me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractCluster.java:187)
>
> ������� at
> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
>
> ������� at
> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
>
> ������� at
> me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(AbstractCluster.java:201)
>
> ������� at
> org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(CassandraClient.java:82)
>
> ������� at
> org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.java:69)
>
> ������� at
> org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:68)
>
> ������� ... 18 more