You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Staffan Olsson (JIRA)" <ji...@apache.org> on 2010/11/11 20:41:13 UTC
[jira] Updated: (TIKA-548) PDF content extracted as single line
[ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Staffan Olsson updated TIKA-548:
--------------------------------
Attachment: tika-PDF-content-regression-test.patch
> PDF content extracted as single line
> ------------------------------------
>
> Key: TIKA-548
> URL: https://issues.apache.org/jira/browse/TIKA-548
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Staffan Olsson
> Attachments: tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.