You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Srikanth Shankara Rao <sr...@aditi.com> on 2014/05/05 15:02:22 UTC

Post process Nutch data

Hi All,

I have crawled Nutch data using 1.8. Data is in HDFS. I would like to post-process this data before indexing into SOLR. The idea is to transform the data based on the content and add few additional fields that describe the content.

I would like to do this as part of a hadoop job. What would be the best place to add code?

Thanks
Srikanth

RE: Post process Nutch data

Posted by Srikanth Shankara Rao <sr...@aditi.com>.

Thanks Julien. This helps. I’ll look into this.

From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Monday, May 05, 2014 8:57 PM
To: dev@nutch.apache.org
Subject: Re: Post process Nutch data

Hi

As mentioned earlier in a different discussion on this list behemoth would be the right tool for this

Julien

On Monday, 5 May 2014, Srikanth Shankara Rao <sr...@aditi.com>> wrote:

Hi All,

I have crawled Nutch data using 1.8. Data is in HDFS. I would like to post-process this data before indexing into SOLR. The idea is to transform the data based on the content and add few additional fields that describe the content.

I would like to do this as part of a hadoop job. What would be the best place to add code?

Thanks
Srikanth

--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Post process Nutch data

Posted by Julien Nioche <li...@gmail.com>.

Hi

As mentioned earlier in a different discussion on this list behemoth would
be the right tool for this

Julien

On Monday, 5 May 2014, Srikanth Shankara Rao <sr...@aditi.com> wrote:

>
> Hi All,
>
> I have crawled Nutch data using 1.8. Data is in HDFS. I would like to
> post-process this data before indexing into SOLR. The idea is to transform
> the data based on the content and add few additional fields that describe
> the content.
>
> I would like to do this as part of a hadoop job. What would be the best
> place to add code?
>
> Thanks
> Srikanth
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble