You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Srikanth Shankara Rao <sr...@aditi.com> on 2014/05/05 15:02:22 UTC
Post process Nutch data
Hi All,
I have crawled Nutch data using 1.8. Data is in HDFS. I would like to post-process this data before indexing into SOLR. The idea is to transform the data based on the content and add few additional fields that describe the content.
I would like to do this as part of a hadoop job. What would be the best place to add code?
Thanks
Srikanth
RE: Post process Nutch data
Posted by Srikanth Shankara Rao <sr...@aditi.com>.
Thanks Julien. This helps. I’ll look into this.
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Monday, May 05, 2014 8:57 PM
To: dev@nutch.apache.org
Subject: Re: Post process Nutch data
Hi
As mentioned earlier in a different discussion on this list behemoth would be the right tool for this
Julien
On Monday, 5 May 2014, Srikanth Shankara Rao <sr...@aditi.com>> wrote:
Hi All,
I have crawled Nutch data using 1.8. Data is in HDFS. I would like to post-process this data before indexing into SOLR. The idea is to transform the data based on the content and add few additional fields that describe the content.
I would like to do this as part of a hadoop job. What would be the best place to add code?
Thanks
Srikanth
--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: Post process Nutch data
Posted by Julien Nioche <li...@gmail.com>.
Hi
As mentioned earlier in a different discussion on this list behemoth would
be the right tool for this
Julien
On Monday, 5 May 2014, Srikanth Shankara Rao <sr...@aditi.com> wrote:
>
> Hi All,
>
> I have crawled Nutch data using 1.8. Data is in HDFS. I would like to
> post-process this data before indexing into SOLR. The idea is to transform
> the data based on the content and add few additional fields that describe
> the content.
>
> I would like to do this as part of a hadoop job. What would be the best
> place to add code?
>
> Thanks
> Srikanth
>
--
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble