You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lincoln Ritter <li...@lincolnritter.com> on 2008/07/19 01:21:05 UTC
Re: Streaming.jar for Nutch?

Hi again.

Has there been any progress on this front?  I'm itching for a clean
way to use ruby for our analysis tasks.  I've been attempting to use
JRuby for this but without much success thus far.  Our primary concern
right now is parsing pages fetched with nutch into an intermediate
representation (likely in JSON) and then running various analysis
tasks over the generated JSON.

By way of background, my thinking on the JRuby front has been to
create a Java "harness" for m/r task that then dispatch to ruby code
to perform the required tasks.  This is turning out to be quite
cumbersome.

Thanks for your thoughts and solutions!

-lincoln

--
lincolnritter.com



On Wed, Jun 11, 2008 at 3:37 PM, Michael Gottesman <go...@reed.edu> wrote:
> I am working on a solution at the moment that involves writing out the data
> in a json format during the actual fetch during the Fetcher phase. This is
> the only time when the CrawlDatum/Content data are together, so makes sense
> if you are only interested in the actual website content and the
> metadata/fetch status (provided by the CrawlDatum)
>
> I will post it in a pastie if you are interested later tonight.
>
> Lincoln Ritter wrote:
>>
>> Chris,
>>
>> We've been considering doing something similar.  Since our development
>> environment is primarily ruby and we don't have a ton of java
>> expertise in our shop we're looking for a way to give our ruby people
>> a low-friction way of processing our crawl data.
>>
>> I've mainly been thinking about a JRuby solution.  My concern is
>> performance, but if it's not too bad then I'll take the trade off so
>> that our devs have a nicer time working with data.
>>
>> I'm pretty new to Nutch/Hadoop, but I'd love collaborate on any
>> solution that makes using ruby and Hadoop/Nutch easier.
>>
>> -lincoln
>>
>> --
>> lincolnritter.com
>>
>>
>>
>> On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <db...@gmail.com>
>> wrote:
>>
>>>
>>> Chris,
>>>
>>> I'm not sure I understand completely, but I would try to write a parser
>>> plugin that pipes content to an external ruby process...or even just use
>>> JRuby. This way would keep you from having to worry about the
>>> complexities
>>> of interacting with Hadoop directly.
>>>
>>> What kind of ruby parsing are you looking to do? I had considered doing
>>> the
>>> same thing to parse and sanitize news feeds.
>>>
>>> -dave
>>>
>>> --
>>> david grandinetti
>>> ideas for food and code
>>>
>>>
>>> On Jun 11, 2008, at 16:46, "Chris Anderson" <jc...@gmail.com> wrote:
>>>
>>>
>>>>
>>>> We're planning to run some Ruby parsers on the fetched content from a
>>>> Nutch crawl. It seems like the best way to do this would be through an
>>>> interface like Hadoop's streaming.jar, but streaming.jar expects a
>>>> line-based input format.
>>>>
>>>> Has anyone written a version of streaming.jar for Nutch? We're working
>>>> on one, so if you'd like to collaborate (or have any advice), please
>>>> reply!
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchris.mfdz.com
>>>>
>>>>
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchris.mfdz.com
>>>>
>>
>>
>