You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/10/26 12:40:54 UTC

Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Hi,

Is there a way to tell nutch not to parse the pages it fetches? meaning just
to extract the links from it?
I know there is a "-no parsing" attribute,but still i need to d/l some
contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
use the option.

Thank you,

-- 
Eyal Edri

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Posted by Dennis Kubes <ku...@apache.org>.

The noparsing option will still download and store the content.  It 
simply will not parse the content.

Dennis Kubes

eyal edri wrote:
> Hi,
> 
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?
> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.
> 
> Thank you,
>

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Posted by "joel.gump" <bi...@gmail.com>.

maybe you can try to use

http://search.capan.org/~podmaster/HTML-LinkExtractor-0.13

eyal edri wrote:
> Hi,
>
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?
> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.
>
> Thank you,
>
>

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Posted by eyal edri <ey...@gmail.com>.

I understand,
but my intention was on parsing the text and collecting keywords for
indexing/query.
with the overal intention on increasing the speed of the fetcher and
updatedb.

is there a way to do it (maybe removing serveral plugins?)


On 10/26/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> eyal edri wrote:
> > Hi,
> >
> > Is there a way to tell nutch not to parse the pages it fetches? meaning
> just
> > to extract the links from it?
>
> Extracting links requires that a page is downloaded first (otherwise
> where would you extract the links from?) and parsed (otherwise how would
> you extract links from an unintelligible byte[]?).
>
>
> > I know there is a "-no parsing" attribute,but still i need to d/l some
> > contentTypes using the parse-XXX plugins.. so i'm not sure it will work
> if i
> > use the option.
>
> No download -> no parsing -> no outlinks.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Eyal Edri

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Posted by Andrzej Bialecki <ab...@getopt.org>.

eyal edri wrote:
> Hi,
> 
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?

Extracting links requires that a page is downloaded first (otherwise 
where would you extract the links from?) and parsed (otherwise how would 
you extract links from an unintelligible byte[]?).


> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.

No download -> no parsing -> no outlinks.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com