You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Michael Schmitz <mi...@schmitztech.com> on 2010/12/10 20:00:28 UTC

Fwd: Tika Snapshot Fails on PDF Articles

Hi,

I don't think the current snapshot is parsing articles (pdfs with
columns/beads) correctly.  The text is not in the write order as it
intermixes text from different beads.  Try it on an academic paper.

http://turing.cs.washington.edu/papers/acl08.pdf

Tika App 0.8 parses the text in the right order but omits spaces.  PDFBox
1.3.1 parses the file wonderfully.  I attached a parsing of the pdf using
each utility.

Peace.  Michael

Re: Tika Snapshot Fails on PDF Articles

Posted by Staffan <so...@gmail.com>.
2010/12/10 Michael Schmitz <mi...@schmitztech.com>:
> Hi,
>
> I don't think the current snapshot is parsing articles (pdfs with
> columns/beads) correctly.  The text is not in the write order as it
> intermixes text from different beads.  Try it on an academic paper.
>
> http://turing.cs.washington.edu/papers/acl08.pdf
>
> Tika App 0.8 parses the text in the right order but omits spaces.  PDFBox
> 1.3.1 parses the file wonderfully.  I attached a parsing of the pdf using
> each utility.
>
> Peace.  Michael
>
>

Could be related to https://issues.apache.org/jira/browse/TIKA-548