You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris Anderson <jc...@grabb.it> on 2008/06/10 01:55:18 UTC

Streaming.jar for Nutch?

We're planning to run some Ruby parsers on the fetched content from a
Nutch crawl. It seems like the best way to do this would be through an
interface like Hadoop's streaming.jar, but streaming.jar expects a
line-based input format.

Has anyone written a version of streaming.jar for Nutch? We're working
on one, so if you'd like to collaborate (or have any advice), please
reply!

Thanks,
Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Streaming.jar for Nutch?

Posted by Lincoln Ritter <li...@lincolnritter.com>.
Hi again.

Has there been any progress on this front?  I'm itching for a clean
way to use ruby for our analysis tasks.  I've been attempting to use
JRuby for this but without much success thus far.  Our primary concern
right now is parsing pages fetched with nutch into an intermediate
representation (likely in JSON) and then running various analysis
tasks over the generated JSON.

By way of background, my thinking on the JRuby front has been to
create a Java "harness" for m/r task that then dispatch to ruby code
to perform the required tasks.  This is turning out to be quite
cumbersome.

Thanks for your thoughts and solutions!

-lincoln

--
lincolnritter.com



On Wed, Jun 11, 2008 at 3:37 PM, Michael Gottesman <go...@reed.edu> wrote:
> I am working on a solution at the moment that involves writing out the data
> in a json format during the actual fetch during the Fetcher phase. This is
> the only time when the CrawlDatum/Content data are together, so makes sense
> if you are only interested in the actual website content and the
> metadata/fetch status (provided by the CrawlDatum)
>
> I will post it in a pastie if you are interested later tonight.
>
> Lincoln Ritter wrote:
>>
>> Chris,
>>
>> We've been considering doing something similar.  Since our development
>> environment is primarily ruby and we don't have a ton of java
>> expertise in our shop we're looking for a way to give our ruby people
>> a low-friction way of processing our crawl data.
>>
>> I've mainly been thinking about a JRuby solution.  My concern is
>> performance, but if it's not too bad then I'll take the trade off so
>> that our devs have a nicer time working with data.
>>
>> I'm pretty new to Nutch/Hadoop, but I'd love collaborate on any
>> solution that makes using ruby and Hadoop/Nutch easier.
>>
>> -lincoln
>>
>> --
>> lincolnritter.com
>>
>>
>>
>> On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <db...@gmail.com>
>> wrote:
>>
>>>
>>> Chris,
>>>
>>> I'm not sure I understand completely, but I would try to write a parser
>>> plugin that pipes content to an external ruby process...or even just use
>>> JRuby. This way would keep you from having to worry about the
>>> complexities
>>> of interacting with Hadoop directly.
>>>
>>> What kind of ruby parsing are you looking to do? I had considered doing
>>> the
>>> same thing to parse and sanitize news feeds.
>>>
>>> -dave
>>>
>>> --
>>> david grandinetti
>>> ideas for food and code
>>>
>>>
>>> On Jun 11, 2008, at 16:46, "Chris Anderson" <jc...@gmail.com> wrote:
>>>
>>>
>>>>
>>>> We're planning to run some Ruby parsers on the fetched content from a
>>>> Nutch crawl. It seems like the best way to do this would be through an
>>>> interface like Hadoop's streaming.jar, but streaming.jar expects a
>>>> line-based input format.
>>>>
>>>> Has anyone written a version of streaming.jar for Nutch? We're working
>>>> on one, so if you'd like to collaborate (or have any advice), please
>>>> reply!
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchris.mfdz.com
>>>>
>>>>
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchris.mfdz.com
>>>>
>>
>>
>

Re: Streaming.jar for Nutch?

Posted by Michael Gottesman <go...@reed.edu>.
I am working on a solution at the moment that involves writing out the 
data in a json format during the actual fetch during the Fetcher phase. 
This is the only time when the CrawlDatum/Content data are together, so 
makes sense if you are only interested in the actual website content and 
the metadata/fetch status (provided by the CrawlDatum)

I will post it in a pastie if you are interested later tonight.

Lincoln Ritter wrote:
> Chris,
>
> We've been considering doing something similar.  Since our development
> environment is primarily ruby and we don't have a ton of java
> expertise in our shop we're looking for a way to give our ruby people
> a low-friction way of processing our crawl data.
>
> I've mainly been thinking about a JRuby solution.  My concern is
> performance, but if it's not too bad then I'll take the trade off so
> that our devs have a nicer time working with data.
>
> I'm pretty new to Nutch/Hadoop, but I'd love collaborate on any
> solution that makes using ruby and Hadoop/Nutch easier.
>
> -lincoln
>
> --
> lincolnritter.com
>
>
>
> On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <db...@gmail.com> wrote:
>   
>> Chris,
>>
>> I'm not sure I understand completely, but I would try to write a parser
>> plugin that pipes content to an external ruby process...or even just use
>> JRuby. This way would keep you from having to worry about the complexities
>> of interacting with Hadoop directly.
>>
>> What kind of ruby parsing are you looking to do? I had considered doing the
>> same thing to parse and sanitize news feeds.
>>
>> -dave
>>
>> --
>> david grandinetti
>> ideas for food and code
>>
>>
>> On Jun 11, 2008, at 16:46, "Chris Anderson" <jc...@gmail.com> wrote:
>>
>>     
>>> We're planning to run some Ruby parsers on the fetched content from a
>>> Nutch crawl. It seems like the best way to do this would be through an
>>> interface like Hadoop's streaming.jar, but streaming.jar expects a
>>> line-based input format.
>>>
>>> Has anyone written a version of streaming.jar for Nutch? We're working
>>> on one, so if you'd like to collaborate (or have any advice), please
>>> reply!
>>>
>>> Thanks,
>>> Chris
>>>
>>> --
>>> Chris Anderson
>>> http://jchris.mfdz.com
>>>
>>>
>>>
>>> --
>>> Chris Anderson
>>> http://jchris.mfdz.com
>>>       
>
>   

Re: Streaming.jar for Nutch?

Posted by Lincoln Ritter <li...@lincolnritter.com>.
Chris,

We've been considering doing something similar.  Since our development
environment is primarily ruby and we don't have a ton of java
expertise in our shop we're looking for a way to give our ruby people
a low-friction way of processing our crawl data.

I've mainly been thinking about a JRuby solution.  My concern is
performance, but if it's not too bad then I'll take the trade off so
that our devs have a nicer time working with data.

I'm pretty new to Nutch/Hadoop, but I'd love collaborate on any
solution that makes using ruby and Hadoop/Nutch easier.

-lincoln

--
lincolnritter.com



On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <db...@gmail.com> wrote:
> Chris,
>
> I'm not sure I understand completely, but I would try to write a parser
> plugin that pipes content to an external ruby process...or even just use
> JRuby. This way would keep you from having to worry about the complexities
> of interacting with Hadoop directly.
>
> What kind of ruby parsing are you looking to do? I had considered doing the
> same thing to parse and sanitize news feeds.
>
> -dave
>
> --
> david grandinetti
> ideas for food and code
>
>
> On Jun 11, 2008, at 16:46, "Chris Anderson" <jc...@gmail.com> wrote:
>
>> We're planning to run some Ruby parsers on the fetched content from a
>> Nutch crawl. It seems like the best way to do this would be through an
>> interface like Hadoop's streaming.jar, but streaming.jar expects a
>> line-based input format.
>>
>> Has anyone written a version of streaming.jar for Nutch? We're working
>> on one, so if you'd like to collaborate (or have any advice), please
>> reply!
>>
>> Thanks,
>> Chris
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>

Re: Streaming.jar for Nutch?

Posted by Chris Anderson <jc...@grabb.it>.
On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <db...@gmail.com> wrote:
> Chris,
>
> I'm not sure I understand completely, but I would try to write a parser
> plugin that pipes content to an external ruby process...or even just use
> JRuby. This way would keep you from having to worry about the complexities
> of interacting with Hadoop directly.

Well I have lots of reasons to be able to run additional Hadoop jobs
on the data, and I've found so far that interacting with Hadoop is
easier than Nutch, for the most part - so I'd like to be able to
process fetched documents using streaming jar or other generic Hadoop
jobs.

It looks like our team is about to have a solution to the problem, so
hopefully we'll be able to post something soon. The nice thing about
streaming.jar is that it works with any process that can take input
over stdin. So Ruby is strictly an afterthought. :)

Our solution will likely use JSON as the line-protocol for map and
reduce scripts to handle -- well-formed json is a great container for
fetched html & a little associated metadata.

>
> What kind of ruby parsing are you looking to do? I had considered doing the
> same thing to parse and sanitize news feeds.
>

The Ruby parsers all use Hpricot.XML, and are wrapped really tightly
around our domain. I do recommend using Hpricot for all your parsing
needs. It's fast and not prone to (many) surprises.


-- 
Chris Anderson
http://jchris.mfdz.com

Re: Streaming.jar for Nutch?

Posted by David Grandinetti <db...@gmail.com>.
Chris,

I'm not sure I understand completely, but I would try to write a  
parser plugin that pipes content to an external ruby process...or even  
just use JRuby. This way would keep you from having to worry about the  
complexities of interacting with Hadoop directly.

What kind of ruby parsing are you looking to do? I had considered  
doing the same thing to parse and sanitize news feeds.

-dave

--
david grandinetti
ideas for food and code


On Jun 11, 2008, at 16:46, "Chris Anderson" <jc...@gmail.com> wrote:

> We're planning to run some Ruby parsers on the fetched content from a
> Nutch crawl. It seems like the best way to do this would be through an
> interface like Hadoop's streaming.jar, but streaming.jar expects a
> line-based input format.
>
> Has anyone written a version of streaming.jar for Nutch? We're working
> on one, so if you'd like to collaborate (or have any advice), please
> reply!
>
> Thanks,
> Chris
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>
>
>
> -- 
> Chris Anderson
> http://jchris.mfdz.com

Streaming.jar for Nutch?

Posted by Chris Anderson <jc...@gmail.com>.
We're planning to run some Ruby parsers on the fetched content from a
Nutch crawl. It seems like the best way to do this would be through an
interface like Hadoop's streaming.jar, but streaming.jar expects a
line-based input format.

Has anyone written a version of streaming.jar for Nutch? We're working
on one, so if you'd like to collaborate (or have any advice), please
reply!

Thanks,
Chris

--
Chris Anderson
http://jchris.mfdz.com



-- 
Chris Anderson
http://jchris.mfdz.com