You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bowen Masco <bo...@codingfoo.com> on 2012/01/13 00:33:28 UTC

nutch, oozie and elasticsearch

So I have recently did some work adding elasticsearch indexing to
nutch and creating workflows to run nutch tasks with oozie.

Our fork is on github:

https://github.com/tatemae/nutch/tree/esindex

The code is on the esindex branch.

So far I am able to run nutch tasks from oozie and index content into
elasticsearch.

I added a util class to nutch and workflow for oozie that allows a url
to be programmatically added to a urls folder for later injection.

Stuff is a little messy, it was kinda experimental.

I will write a more detailed blog post in the future.

Any comments or questions are welcome.

Bowen

Re: nutch, oozie and elasticsearch

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Bowen,

Something which popped into my head was Julien's ticket on our Jira, please
see [1]. Julien's description on the issue will provide enough insight into
where this one would/could go ideally.

Thanks

Lewis

[1] https://issues.apache.org/jira/browse/NUTCH-1047

On Fri, Jan 13, 2012 at 12:19 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Bowen,
>
> I completely agree with Chris' comments, there have been a few guys
> popping up from time to time asking about ES therefore any contrib in this
> area would be excellent.
>
> In the meantime I'll check your code out on Github.
>
> Thanks for letting us in the loop.
>
> Lewis
>
>
> On Fri, Jan 13, 2012 at 5:57 AM, Bowen Masco <bo...@codingfoo.com> wrote:
>
>> Shouldn't be too hard to port. In fact my plan was to get this going
>> in production to shake out any bugs then work it into a series of
>> patches. There are some small things that are particular to how we are
>> using Nutch that would not be relevant to everyone else, that I plan
>> on removing. I will look at doing the initial work in the next few
>> days.
>>
>> We are definitely interested in/planning on contributing the code to
>> Apache, I already obtained permission to do this, it is just working
>> out the details. I will message you off list about this.
>>
>> Bowen
>>
>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: nutch, oozie and elasticsearch

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Bowen,

I completely agree with Chris' comments, there have been a few guys popping
up from time to time asking about ES therefore any contrib in this area
would be excellent.

In the meantime I'll check your code out on Github.

Thanks for letting us in the loop.

Lewis

On Fri, Jan 13, 2012 at 5:57 AM, Bowen Masco <bo...@codingfoo.com> wrote:

> Shouldn't be too hard to port. In fact my plan was to get this going
> in production to shake out any bugs then work it into a series of
> patches. There are some small things that are particular to how we are
> using Nutch that would not be relevant to everyone else, that I plan
> on removing. I will look at doing the initial work in the next few
> days.
>
> We are definitely interested in/planning on contributing the code to
> Apache, I already obtained permission to do this, it is just working
> out the details. I will message you off list about this.
>
> Bowen
>



-- 
*Lewis*

Re: nutch, oozie and elasticsearch

Posted by Bowen Masco <bo...@codingfoo.com>.
Shouldn't be too hard to port. In fact my plan was to get this going
in production to shake out any bugs then work it into a series of
patches. There are some small things that are particular to how we are
using Nutch that would not be relevant to everyone else, that I plan
on removing. I will look at doing the initial work in the next few
days.

We are definitely interested in/planning on contributing the code to
Apache, I already obtained permission to do this, it is just working
out the details. I will message you off list about this.

Bowen

Re: nutch, oozie and elasticsearch

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Bowen,

This is REALLY awesome! 

How hard do you think it would be to port your github.com
contribution on top of the existing Nutch trunk as a series
of JIRA issues and patches? I'd be happy to help out here.

Are you interested in contributing the code to Apache? If so,
do you think you would be amenable to signing an Individual
Contributor License Agreement (ICLA) [1] for the ASF?

Thanks and really really awesome here.

Cheers,
Chris

[1] http://www.apache.org/licenses/icla.txt

On Jan 12, 2012, at 3:33 PM, Bowen Masco wrote:

> So I have recently did some work adding elasticsearch indexing to
> nutch and creating workflows to run nutch tasks with oozie.
> 
> Our fork is on github:
> 
> https://github.com/tatemae/nutch/tree/esindex
> 
> The code is on the esindex branch.
> 
> So far I am able to run nutch tasks from oozie and index content into
> elasticsearch.
> 
> I added a util class to nutch and workflow for oozie that allows a url
> to be programmatically added to a urls folder for later injection.
> 
> Stuff is a little messy, it was kinda experimental.
> 
> I will write a more detailed blog post in the future.
> 
> Any comments or questions are welcome.
> 
> Bowen


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: nutch, oozie and elasticsearch

Posted by anupamk <an...@usc.edu>.
Hi Talat,

Good news! I got the job working.

The problem was that I had not renamed the JOB file to JAR file. 

I spent about 8 hours troubleshooting this stupid mistake! 

Never been gladder to re-read your post more carefully!



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4131525.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch, oozie and elasticsearch

Posted by anupamk <an...@usc.edu>.
Hi Talat,

Eagerly awaiting the documentation. Please let me know when you have it out. 

thanks.

Anupam



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4131366.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch, oozie and elasticsearch

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi anupamk,

Today I start a documentation for Oozie with Nutch. I will share.

Thanks

2014-04-15 2:59 GMT+03:00 anupamk <an...@usc.edu>:
> Hi Talat,
>
> I am facing the same ClassNotFoundException. I am putting the required JAR
> files in the lib directory of my workflow. However, when I submit the job,
> the contents in the lib directory gets replaced. with the just the .job
> file.
>
> I am using apache-nutch-1.7
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4131102.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: nutch, oozie and elasticsearch

Posted by anupamk <an...@usc.edu>.
Hi Talat,

I am facing the same ClassNotFoundException. I am putting the required JAR
files in the lib directory of my workflow. However, when I submit the job,
the contents in the lib directory gets replaced. with the just the .job
file. 

I am using apache-nutch-1.7



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4131102.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch, oozie and elasticsearch

Posted by Talat UYARER <ta...@agmlab.com>.
Hi vivekl,

You should upload nutch library jars in to your working directory which 
is store your workflow file in hdfs, as Libs directory.

It will be solved.

Talat

28-10-2013 14:10 tarihinde, vivekvl yazdı:
> Hi Talat,
>
> 	Thanks for your tips, with Java workflow Action I am able to submit Nutch
> job to Hadoop.
>
> 	I am facing some classpath configuration issues in submitted Map Reduce
> Job.
>
> 	Getting below exception repeatedly
>
> /java.lang.RuntimeException: java.lang.ClassNotFoundException:
> org.apache.gora.mapreduce.GoraOutputFormat
>           at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
>           at
> org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:197)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.gora.mapreduce.GoraOutputFormat/
>
> 	I have set oozie.use.system.libpath=true in job.properties and ensured
> availability of jars in configured HDFS location. Also tried having all jars
> in applications lib directory. Nothing helps.
>
> 	Stuck in Oozie classpath configuration. Can you please share classpath
> configuration details (jar and Nutch configurations)
>
> 	environment details: oozie-3.0.0, Nutch 2.1 with hadoop-0.20.3 &
> hbase-0.90.6.
>
> Thanks,
> vivek
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4098054.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


Re: nutch, oozie and elasticsearch

Posted by vivekvl <vi...@yahoo.com>.
Hi Talat,

	Thanks for your tips, with Java workflow Action I am able to submit Nutch
job to Hadoop.

	I am facing some classpath configuration issues in submitted Map Reduce
Job.

	Getting below exception repeatedly

/java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.gora.mapreduce.GoraOutputFormat
         at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
         at
org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:197)
Caused by: java.lang.ClassNotFoundException:
org.apache.gora.mapreduce.GoraOutputFormat/

	I have set oozie.use.system.libpath=true in job.properties and ensured
availability of jars in configured HDFS location. Also tried having all jars
in applications lib directory. Nothing helps.

	Stuck in Oozie classpath configuration. Can you please share classpath
configuration details (jar and Nutch configurations)

	environment details: oozie-3.0.0, Nutch 2.1 with hadoop-0.20.3 &
hbase-0.90.6.

Thanks,
vivek



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4098054.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch, oozie and elasticsearch

Posted by Talat UYARER <ta...@agmlab.com>.
You are welcome vivekvl.

I will prepair a document of ozzie usage. Basicly, you should upload 
your dependecy to Libs directory under the working directory in hdfs. 
After than you should upload your job file as a jar file.

When you create your workflow.xml you should create java action for 
every Job(Inject, Generate, Fetch etc.).Dont forget write every args 
split one to one line. For -top 1000:
<args>top</args>
<args>1000</args>

If you do this, it will be work. Create your workflow.xml if you have 
problem, we can talk again. I will write a document ASAP.

Talat

22-10-2013 15:55 tarihinde, vivekvl yazdı:
> Thanks Talat,
>
> I am trying to configure Oozie workflow.xml for executing Nutch job.
>
> Should I need some customization for launching Nutch Job ? (i.e.) creating
> main class by extending org.apache.oozie.action.hadoop.MapReduceMain to
> accommodate Nutch Configuration
>
> Also I like to know the best approach for configuring Nutch libraries in
> Oozie.
> -Vivek
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4097021.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


Re: nutch, oozie and elasticsearch

Posted by vivekvl <vi...@yahoo.com>.
Thanks Talat,

I am trying to configure Oozie workflow.xml for executing Nutch job.

Should I need some customization for launching Nutch Job ? (i.e.) creating
main class by extending org.apache.oozie.action.hadoop.MapReduceMain to
accommodate Nutch Configuration

Also I like to know the best approach for configuring Nutch libraries in
Oozie.
-Vivek



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4097021.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch, oozie and elasticsearch

Posted by Talat UYARER <ta...@agmlab.com>.
Hi,
You can ask your problem. :) I use nutch in oozie with hue.


22-10-2013 14:37 tarihinde, vivekvl yazdı:
> Hi Bowen,
>
> Glad to know you have already made this.
>
> I am trying to configure Nutch job in Oozie, facing some difficulties.
>
> Is it possible make the [/nutch/tree/esindex] branch available again. It is
> not accessible now.
>
> Thanks,
> Vivek
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4097004.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


Re: nutch, oozie and elasticsearch

Posted by vivekvl <vi...@yahoo.com>.
Hi Bowen,

Glad to know you have already made this.

I am trying to configure Nutch job in Oozie, facing some difficulties.

Is it possible make the [/nutch/tree/esindex] branch available again. It is
not accessible now.

Thanks,
Vivek





--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-oozie-and-elasticsearch-tp3655313p4097004.html
Sent from the Nutch - User mailing list archive at Nabble.com.