You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2011/09/24 17:43:13 UTC

how can i crawl pdfs?

hi all,
when i crawl pdfs ,nutch fetch any link in pdfs ,
how can i omit this?
thanks a lot.

--
View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-crawl-pdfs-tp3364548p3364548.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how can i crawl pdfs?

Posted by Markus Jelsma <ma...@openindex.io>.

Right now you cannot omit this feature as it is baked in TikaParser. If no 
outlinks are detected the parser will use OutlinkExtractor to find plain text 
URL's and collect those as outlink.

You can open an issue for support of an option to toggle the outlink 
extractor.

> hi all,
> when i crawl pdfs ,nutch fetch any link in pdfs ,
> how can i omit this?
> thanks a lot.
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-can-i-crawl-pdfs-tp3364548p3364548.
> html Sent from the Nutch - User mailing list archive at Nabble.com.