You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Staffan Olsson (JIRA)" <ji...@apache.org> on 2010/11/11 20:39:15 UTC

[jira] Created: (TIKA-548) PDF content extracted as single line

PDF content extracted as single line
------------------------------------

                 Key: TIKA-548
                 URL: https://issues.apache.org/jira/browse/TIKA-548
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.8
            Reporter: Staffan Olsson


Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.

Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
$> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
PDF Title For Short Document
veryshortpdfcontents

But Tika prints:
$> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
...
<p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
Title For Short Documentveryshortpdfcontents</p>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-548) PDF content extracted as single line

Posted by "Reinhard Schwab (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reinhard Schwab updated TIKA-548:
---------------------------------

    Attachment: test.pdf

this is a sample pdf document to reproduce the regression.

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: test.pdf, tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-548) PDF content extracted as single line

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933706#action_12933706 ] 

Chris A. Mattmann commented on TIKA-548:
----------------------------------------

+1 to a patch release if we need to Jukka let me know...

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-548) PDF content extracted as single line

Posted by "Paul Pearcy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975864#action_12975864 ] 

Paul Pearcy commented on TIKA-548:
----------------------------------

+1 for a 8.1 release, unless the 9.0 is imminent. 

Thanks!

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: test.pdf, tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-548) PDF content extracted as single line

Posted by "Paul Pearcy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983411#action_12983411 ] 

Paul Pearcy commented on TIKA-548:
----------------------------------

Just wanted to say that I don't believe there is a stable version of TIKA available because of this issue. This issue is fixed on the trunk, but the trunk has a file handle leak problem that prevents large scale usage of this fix:
https://issues.apache.org/jira/browse/TIKA-567

Thanks

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: test.pdf, tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-548) PDF content extracted as single line

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933536#action_12933536 ] 

Staffan Olsson commented on TIKA-548:
-------------------------------------

Verified to work with Solr. Thanks for the fix.

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-548) PDF content extracted as single line

Posted by "Reinhard Schwab (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964592#action_12964592 ] 

Reinhard Schwab commented on TIKA-548:
--------------------------------------

i have generated this document with openoffice and pdf export.
a tabulator is missing.



> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: test.pdf, tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-548) PDF content extracted as single line

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-548.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9
         Assignee: Jukka Zitting

Fixed in revision 1036562. We may want to do a 0.8.1 patch release with this and perhaps some other fixes.

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-548) PDF content extracted as single line

Posted by "Reinhard Schwab (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964590#action_12964590 ] 

Reinhard Schwab commented on TIKA-548:
--------------------------------------

there is still a regression there:
i miss some white spaces comparing the trunk from today with an earlier snapshot of tika from august 
and comparing with the output from pdf text stripper
i can not provide my sample pdf file, but may be i will find another.
i can only give an example line of text

snapshot tika-0.8 from august, pdf text stripper:
Familienstand: ledig

trunk:
Familienstand:ledig



> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-548) PDF content extracted as single line

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Staffan Olsson updated TIKA-548:
--------------------------------

    Attachment: tika-PDF-content-regression-test.patch

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>         Attachments: tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-548) PDF content extracted as single line

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966937#action_12966937 ] 

Jukka Zitting commented on TIKA-548:
------------------------------------

Good point, thanks! I fixed the problem with missing word separators in 1042338.

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: test.pdf, tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.