You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Godmar Back <go...@gmail.com> on 2010/01/07 07:16:26 UTC

alternatives to PDFBox (was: IOException when parsing PDF files)

ps: upon closer examination, it seems that PDFBox is not very mature
software; I was able to fix its parser to go past this first error I
encountered, then discovered that it's not implementing many essential PDF
operators. As a result, the extracted text is pretty bad and one cannot
expect good results from indexing.

What experience have others had with that?  It seems properly indexing PDF
files is crucial for any serious application of nutch.

What would be required if I wanted to use Poppler's 'pdftotext' instead?
Would that require writing a new plug-in in place of PdfParser.java or does
nutch perhaps support a simpler way that would allow me to pipe the .pdf
file through pdftotext, then treating it as a .txt file (but still offer the
cached .pdf in the search results?)

I note that others are dissatisfied with PDFBox as well. [1]

 - Godmar

[1] http://groups.google.com/group/xtf-user/msg/efae13d6cf878691

Re: alternatives to PDFBox (was: IOException when parsing PDF files)

Posted by Godmar Back <go...@gmail.com>.

Thanks Andrzej, I'll check out parse-ext.

To reply to your views on PDFBox, etc. - I understand text extraction is a
hard problem to solve, because it essentially involve reversing a
type-setting procedure, although PDFBox's failure in this case is actually
not related to the inherent difficulties in it (from what I can tell so far,
I'm working with them on fixing it). It's simply that PDFBox, apparently,
hadn't seen PDFs generated by this tool.

I note that there's a comment in PdfParser by John Xing where he voices
skepticism, in 2004, about how well it'll work.

And yes my corpus is unlucky in this sense, but you'll always find that with
any software users base their opinion and involvement on the corpus they
care about. It doesn't help to point out that it works on other people's
data really well ;-)

 - Godmar

Re: alternatives to PDFBox (was: IOException when parsing PDF files)

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-01-07 07:16, Godmar Back wrote:
> ps: upon closer examination, it seems that PDFBox is not very mature
> software; I was able to fix its parser to go past this first error I
> encountered, then discovered that it's not implementing many essential PDF
> operators.

? that's surprising, I've been using PDFBox for many years and it 
usually worked reasonably well, although it was always slow and memory 
hungry. If you have a list of these essential missing operators please 
let know the PDFBox people.

> As a result, the extracted text is pretty bad and one cannot
> expect good results from indexing.

In my experience (been indexing many PDFs too) it depends on the 
application that produced the PDFs ... For a large class of PDFs it 
works reasonably well, although it is resource-hungry (much less so with 
the latest trunk of PDFBox), but there's certainly a sizeable class of 
docs where it fails miserably. Other tools fail too, just not for the 
same documents, and often do it silently (e.g. skipping a problematic 
chunk, or producing garbled output). This is true of poppler, and of 
other (commercial) PDF tools. Text extraction from PDF is just a 
difficult problem to solve, for many reasons, among them the fact that 
many PDFs in the wild don't conform to the PDF spec and instead rely on 
idiosyncrasies of Acrobat Reader to parse properly.

>
> What experience have others had with that?  It seems properly indexing PDF
> files is crucial for any serious application of nutch.

It depends on your goals, again ;) Most users of Nutch are concerned 
with typical Web resources, where HTML is still the predominant format, 
so for them a few problematic PDFs are not a big deal ... You are just 
unlucky that your corpus consists of mostly PDFs, as it seems.

Also, content parsing in Nutch is not really being developed in Nutch - 
we use libraries developed by other projects, so in cases like this you 
should report the problems to the respective project (PDFBox in this 
case). This way everybody will benefit, because we (the Nutch project) 
can't really fix the issue in PDFBox.

>
> What would be required if I wanted to use Poppler's 'pdftotext' instead?
> Would that require writing a new plug-in in place of PdfParser.java or does
> nutch perhaps support a simpler way that would allow me to pipe the .pdf
> file through pdftotext, then treating it as a .txt file (but still offer the
> cached .pdf in the search results?)

Please take a look at the parse-ext plugin - it's there precisely for 
this purpose.

> I note that others are dissatisfied with PDFBox as well. [1]
> [1] http://groups.google.com/group/xtf-user/msg/efae13d6cf878691

Umm .. if anything that comment suggests that properly handling diverse 
PDFs is simply a hard thing to do, and PDFBox is not that much to blame.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com