You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jorge Conejero Jarque <jc...@gpm.es> on 2008/05/27 13:46:54 UTC

Crawler Data

I would like to make an application using the API Nutch, could extract data from the pages before being indexed in the index and be able to do them in some kind of modification or processing, because it can become something useful and interesting. 

The problem is that I can not find information on how to use the API part of the Crawl Nutch. 

Only find exercises that are executed by the console with Cygwin and that only explain aspects of configuration and creation of an index.

If you could help me, with some examples.
Thanks.

Un saludo.

Jorge Conejero Jarque
Dpto. Java Technology Group
GPM Factoría Internet
923100300
http://www.gpm.es

Re: Crawler Data

Posted by kranthi reddy <kr...@gmail.com>.

Hi,
If u are trying to extract data from web pages...then u need to work on the
"parse-html" code.
In the "src/plugin" directory u can work on other formats like "pdf,msexcel"
etc...
U need to work on parsing because pages are parsed before they are
indexed.So if u parse them in a different way and extract the data u need
...u can index them using the extracted data.
bye
kranthi

On Tue, May 27, 2008 at 5:16 PM, Jorge Conejero Jarque <jc...@gpm.es>
wrote:

> I would like to make an application using the API Nutch, could extract data
> from the pages before being indexed in the index and be able to do them in
> some kind of modification or processing, because it can become something
> useful and interesting.
>
> The problem is that I can not find information on how to use the API part
> of the Crawl Nutch.
>
> Only find exercises that are executed by the console with Cygwin and that
> only explain aspects of configuration and creation of an index.
>
> If you could help me, with some examples.
> Thanks.
>
> Un saludo.
>
> Jorge Conejero Jarque
> Dpto. Java Technology Group
> GPM Factoría Internet
> 923100300
> http://www.gpm.es
>
>
>

Re: Crawler Data

Posted by Chris Anderson <jc...@grabb.it>.

I'm in a similar position. I'd like to be able to run arbitrary Hadoop
jobs across the pages saved by Nutch. This should be simple enough,
but I haven't found any direct documentation of how to yet.

Thanks in advance for any pointers.

Chris

On Tue, May 27, 2008 at 4:46 AM, Jorge Conejero Jarque <jc...@gpm.es> wrote:
> I would like to make an application using the API Nutch, could extract data from the pages before being indexed in the index and be able to do them in some kind of modification or processing, because it can become something useful and interesting.
>
> The problem is that I can not find information on how to use the API part of the Crawl Nutch.
>
> Only find exercises that are executed by the console with Cygwin and that only explain aspects of configuration and creation of an index.
>
> If you could help me, with some examples.
> Thanks.
>
> Un saludo.
>
> Jorge Conejero Jarque
> Dpto. Java Technology Group
> GPM Factoría Internet
> 923100300
> http://www.gpm.es
>
>
>



-- 
Chris Anderson
http://jchris.mfdz.com