You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by paddz <pa...@aufwind.cc> on 2013/01/10 13:18:40 UTC

Crawling PDFs

Hi there,

i am having some problems parsing PDFs, i got a website to crawl which
includes some links to pdf files. My problem is that nutch is not
recognizing these links as PDF files. 

The links are just simple output links (http://XYZ/output/4366), with no
file extension and this seems to be the problem, if I rebuild the links with
an .pdf extension nutch crawls them, but that is not really an option for
me.
Is there an other solution, or do i just have an error in my config
elsewhere? I could bet nutch can detect pdfs whether they have an file
extension or not.





--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-PDFs-tp4032174.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling PDFs

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please try altering the http.accept property in nutch-default.xml to the
following

<property>
  <name>http.accept</name>
  <value>text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8</value>
  <description>Value of the "Accept" request header field.
  </description>
</property>



On Mon, Jan 14, 2013 at 2:54 AM, paddz <pa...@aufwind.cc> wrote:

> Thanks for your advice gora, it is being served.
>
> Patrick
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawling-PDFs-no-file-extension-tp4032174p4033107.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Crawling PDFs

Posted by paddz <pa...@aufwind.cc>.
Thanks for your advice gora, it is being served.

Patrick



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-PDFs-no-file-extension-tp4032174p4033107.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling PDFs

Posted by Gora Mohanty <go...@mimirtech.com>.
On 14 January 2013 16:12, paddz <pa...@aufwind.cc> wrote:
>
> Hi Lewis,
>
> i am using nutch 1.5.1
> I get no specific log output or errors.
>
> I am expecting nutch to crawl pdfs with no file extension e.g.
> /output/mypdffile, actually nutch is only crawling/parsing pdfs which look
> like this /output/mypdffile*.pdf*
[...]

Just a thought: Is your PDF content being served with
mimetype="application/pdf"?

Regards,
Gora

Re: Crawling PDFs

Posted by paddz <pa...@aufwind.cc>.
Hi Lewis,

i am using nutch 1.5.1
I get no specific log output or errors.

I am expecting nutch to crawl pdfs with no file extension e.g.
/output/mypdffile, actually nutch is only crawling/parsing pdfs which look
like this /output/mypdffile*.pdf*

readdb stats:
Statistics for CrawlDb: XYZ
TOTAL urls:	104
retry 0:	104
min score:	0.0
avg score:	0.037596155
max score:	1.01
status 2 (db_fetched):	104
CrawlDb statistics: done

Thanks
Patrick




--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-PDFs-no-file-extension-tp4032174p4033105.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling PDFs

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi paddz,
Nutch version?
log output?
readdb results?
Difference between expected behaviour and actual?
Any log output?

Thanks
Lewis

On Thu, Jan 10, 2013 at 4:18 AM, paddz <pa...@aufwind.cc> wrote:

> Hi there,
>
> i am having some problems parsing PDFs, i got a website to crawl which
> includes some links to pdf files. My problem is that nutch is not
> recognizing these links as PDF files.
>
> The links are just simple output links (http://XYZ/output/4366), with no
> file extension and this seems to be the problem, if I rebuild the links
> with
> an .pdf extension nutch crawls them, but that is not really an option for
> me.
> Is there an other solution, or do i just have an error in my config
> elsewhere? I could bet nutch can detect pdfs whether they have an file
> extension or not.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawling-PDFs-tp4032174.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*