You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by nutch_guy <ad...@bluewin.ch> on 2011/01/10 14:16:47 UTC

RE: Crawling PDF documents


Hi 

Thanks for your answer.
My Problem is stil existing, i can crawl pdf documents
but, there are a lot of pdf documents wich are
not supported.

Thank for help

nutch_guy

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-PDF-documents-tp1173626p2226962.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling PDF documents

Posted by Julien Nioche <li...@gmail.com>.

>  Generally, I found that Tika was at that stage not as mature as some of
the pre-Tika plugins, and used whatever worked best for a document type.

while this was true for the HTML parser and the PDF extraction in 0.8 -
which should be fixed in the next version of Tika - I am not aware of Tika
being worse than the legacy plugins.

in most cases Tika wraps the same underlying libraries as the old plugins
but simply exposes them through a unified API. in most cases the versions
used by Tika are more up to date than in the old plugins so there are likely
to be more efficient

of course, you can always help making Tika better (and the underlying
libraries it uses) by reporting the issues you come across on its JIRA (
https://issues.apache.org/jira/browse/TIKA)

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: Crawling PDF documents

Posted by Ar...@csiro.au.

Hi Julien,

>-----Original Message-----
>From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
>Sent: Tuesday, January 11, 2011 8:38 PM
>To: user@nutch.apache.org
>Subject: Re: Crawling PDF documents
>
>Hi Arkadi,
>
>The latest release of Tika (0.8) has indeed some issues with pdf but the
>version in 1.2 (0.7) should be fine. Did you have any specific issues?

Sorry, it was a few months ago. I can't tell now. Generally, I found that Tika was at that stage not as mature as some of the pre-Tika plugins, and used whatever worked best for a document type.

>Note that the parse-pdf plugin has been removed from the next release of
>Nutch (1.3)

Let's hope that Tika is more mature now. If not, I will have to stick with 1.2 for a while. We have a lot of PDF docs on our site and I want them indexed.

Regards,

Arkadi

 

>
>J.
>
>On 11 January 2011 03:12, <Ar...@csiro.au> wrote:
>
>> Hi,
>>
>> >-----Original Message-----
>> >From: nutch_guy [mailto:adrian.stadelmann@bluewin.ch]
>> >Sent: Tuesday, January 11, 2011 12:17 AM
>> >To: nutch-user@lucene.apache.org
>> >Subject: RE: Crawling PDF documents
>> >
>> >
>> >
>> >Hi
>> >
>> >Thanks for your answer.
>> >My Problem is stil existing, i can crawl pdf documents
>> >but, there are a lot of pdf documents wich are
>> >not supported.
>>
>> I also had this problem. The parse-pdf plugin uses old pdf libraries.
>Many
>> of the problems will go away if you upgrade it to the new libraries
>and use
>> it (not Tika!) to parse pdf. You can do it yourself or get upgraded
>sources
>> from here:
>>
>> http://www.atnf.csiro.au/computing/software/arch/
>>
>> Regards,
>>
>> Arkadi
>>
>> >
>> >Thank for help
>> >
>> >nutch_guy
>> >
>> >--
>> >View this message in context:
>> >http://lucene.472066.n3.nabble.com/Crawling-PDF-documents-
>> >tp1173626p2226962.html
>> >Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
>--
>*
>*Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com

Re: Crawling PDF documents

Posted by Julien Nioche <li...@gmail.com>.

Hi Arkadi,

The latest release of Tika (0.8) has indeed some issues with pdf but the
version in 1.2 (0.7) should be fine. Did you have any specific issues?
Note that the parse-pdf plugin has been removed from the next release of
Nutch (1.3)

J.

On 11 January 2011 03:12, <Ar...@csiro.au> wrote:

> Hi,
>
> >-----Original Message-----
> >From: nutch_guy [mailto:adrian.stadelmann@bluewin.ch]
> >Sent: Tuesday, January 11, 2011 12:17 AM
> >To: nutch-user@lucene.apache.org
> >Subject: RE: Crawling PDF documents
> >
> >
> >
> >Hi
> >
> >Thanks for your answer.
> >My Problem is stil existing, i can crawl pdf documents
> >but, there are a lot of pdf documents wich are
> >not supported.
>
> I also had this problem. The parse-pdf plugin uses old pdf libraries. Many
> of the problems will go away if you upgrade it to the new libraries and use
> it (not Tika!) to parse pdf. You can do it yourself or get upgraded sources
> from here:
>
> http://www.atnf.csiro.au/computing/software/arch/
>
> Regards,
>
> Arkadi
>
> >
> >Thank for help
> >
> >nutch_guy
> >
> >--
> >View this message in context:
> >http://lucene.472066.n3.nabble.com/Crawling-PDF-documents-
> >tp1173626p2226962.html
> >Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: Crawling PDF documents

Posted by Ar...@csiro.au.

Hi,

>-----Original Message-----
>From: nutch_guy [mailto:adrian.stadelmann@bluewin.ch]
>Sent: Tuesday, January 11, 2011 12:17 AM
>To: nutch-user@lucene.apache.org
>Subject: RE: Crawling PDF documents
>
>
>
>Hi
>
>Thanks for your answer.
>My Problem is stil existing, i can crawl pdf documents
>but, there are a lot of pdf documents wich are
>not supported.

I also had this problem. The parse-pdf plugin uses old pdf libraries. Many of the problems will go away if you upgrade it to the new libraries and use it (not Tika!) to parse pdf. You can do it yourself or get upgraded sources from here:

http://www.atnf.csiro.au/computing/software/arch/

Regards,

Arkadi

>
>Thank for help
>
>nutch_guy
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Crawling-PDF-documents-
>tp1173626p2226962.html
>Sent from the Nutch - User mailing list archive at Nabble.com.