You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by navinkumar <na...@gmail.com> on 2012/12/26 06:47:38 UTC

Extract data in nutch

Hi ,I’m newbie to nutch,I have successfully installed and configured nutch to
crawl the sites.I want to get the data from crawl?1.Is there any way to get
the data programmatically?2.What is the command to extract the data into
plain text?



--
View this message in context: http://lucene.472066.n3.nabble.com/Extract-data-in-nutch-tp4029072.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Extract data in nutch

Posted by Tejas Patil <te...@gmail.com>.
Hi Navin,
Crawl the data using crawl command[0]. After that, use the readseg
command[1],[2] to dump a text file.
You can easily automate using shell script, python etc scripting languages.

[0] : section 3.1 in http://wiki.apache.org/nutch/NutchTutorial
[1] :
http://www.marco.bianchi.name/myPortal/using-the-binnutch-readseg-command.aspx
[2] : http://wiki.apache.org/nutch/bin/nutch_readseg

Thanks,
Tejas Patil


On Tue, Dec 25, 2012 at 9:47 PM, navinkumar <na...@gmail.com> wrote:

> Hi ,I’m newbie to nutch,I have successfully installed and configured nutch
> to
> crawl the sites.I want to get the data from crawl?1.Is there any way to get
> the data programmatically?2.What is the command to extract the data into
> plain text?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Extract-data-in-nutch-tp4029072.html
> Sent from the Nutch - User mailing list archive at Nabble.com.