You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/10/26 12:40:54 UTC
Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)
Hi,
Is there a way to tell nutch not to parse the pages it fetches? meaning just
to extract the links from it?
I know there is a "-no parsing" attribute,but still i need to d/l some
contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
use the option.
Thank you,
--
Eyal Edri
Re: Is there a way to tell nutch fetcher not to parse for text in
the page? (i.e. just links)
Posted by Dennis Kubes <ku...@apache.org>.
The noparsing option will still download and store the content. It
simply will not parse the content.
Dennis Kubes
eyal edri wrote:
> Hi,
>
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?
> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.
>
> Thank you,
>
Re: Is there a way to tell nutch fetcher not to parse for text in
the page? (i.e. just links)
Posted by "joel.gump" <bi...@gmail.com>.
maybe you can try to use
http://search.capan.org/~podmaster/HTML-LinkExtractor-0.13
eyal edri wrote:
> Hi,
>
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?
> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.
>
> Thank you,
>
>
Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)
Posted by eyal edri <ey...@gmail.com>.
I understand,
but my intention was on parsing the text and collecting keywords for
indexing/query.
with the overal intention on increasing the speed of the fetcher and
updatedb.
is there a way to do it (maybe removing serveral plugins?)
On 10/26/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> eyal edri wrote:
> > Hi,
> >
> > Is there a way to tell nutch not to parse the pages it fetches? meaning
> just
> > to extract the links from it?
>
> Extracting links requires that a page is downloaded first (otherwise
> where would you extract the links from?) and parsed (otherwise how would
> you extract links from an unintelligible byte[]?).
>
>
> > I know there is a "-no parsing" attribute,but still i need to d/l some
> > contentTypes using the parse-XXX plugins.. so i'm not sure it will work
> if i
> > use the option.
>
> No download -> no parsing -> no outlinks.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
--
Eyal Edri
Re: Is there a way to tell nutch fetcher not to parse for text in
the page? (i.e. just links)
Posted by Andrzej Bialecki <ab...@getopt.org>.
eyal edri wrote:
> Hi,
>
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?
Extracting links requires that a page is downloaded first (otherwise
where would you extract the links from?) and parsed (otherwise how would
you extract links from an unintelligible byte[]?).
> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.
No download -> no parsing -> no outlinks.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com