You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dennis Adler (JIRA)" <ji...@apache.org> on 2011/01/14 02:12:45 UTC
[jira] Created: (TIKA-583) Tika 0.8 line break removal is faulty
(misses space when concatenating lines) for PDF file
Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
------------------------------------------------------------------------------------------
Key: TIKA-583
URL: https://issues.apache.org/jira/browse/TIKA-583
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.8
Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
Reporter: Dennis Adler
The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
------- start ---------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
DIVISION ONE
SERGEY SAVCHUK, )
) No. 64269-3-I
Appellant, )
v. )
) UNPUBLISHED OPINION
STEVEN G. JERDE and )
DARLYCE J. JERDE, husband and wife )
)
Respondents. )
_______________________________ ) FILED: November 1, 2010
--------------- end ---------------------
Tika 0.8 has this instead:
-------------- start ---------------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j
--------------- end ---------------------
Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-583) Tika 0.8 line break removal is faulty
(misses space when concatenating lines) for PDF file
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-583.
--------------------------------
Resolution: Duplicate
Assignee: Jukka Zitting
This is a duplicate of TIKA-548, fixed in trunk.
> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
> ------------------------------------------------------------------------------------------
>
> Key: TIKA-583
> URL: https://issues.apache.org/jira/browse/TIKA-583
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
> Reporter: Dennis Adler
> Assignee: Jukka Zitting
> Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
> SERGEY SAVCHUK, )
> ) No. 64269-3-I
> Appellant, )
> v. )
> ) UNPUBLISHED OPINION
> STEVEN G. JERDE and )
> DARLYCE J. JERDE, husband and wife )
> )
> Respondents. )
> _______________________________ ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-583) Tika 0.8 line break removal is faulty
(misses space when concatenating lines) for PDF file
Posted by "Dennis Adler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Adler updated TIKA-583:
------------------------------
Attachment: Savchuk v. Jerde.pdf
Original PDF; parsed with tika-app-0.7 and tika-app-0.8 (release). Sample text in the bug report from the "Plain text" tabs. Found this file on the web, so should be fine for ASF inclusion.
> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
> ------------------------------------------------------------------------------------------
>
> Key: TIKA-583
> URL: https://issues.apache.org/jira/browse/TIKA-583
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
> Reporter: Dennis Adler
> Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
> SERGEY SAVCHUK, )
> ) No. 64269-3-I
> Appellant, )
> v. )
> ) UNPUBLISHED OPINION
> STEVEN G. JERDE and )
> DARLYCE J. JERDE, husband and wife )
> )
> Respondents. )
> _______________________________ ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-583) Tika 0.8 line break removal is faulty
(misses space when concatenating lines) for PDF file
Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981822#action_12981822 ]
Ken Krugler commented on TIKA-583:
----------------------------------
Is this a PDFBox issue or a Tika issue? Any chance you could re-run it with Tika 0.8, but using the PDFBox jar from Tika 0.7?
> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
> ------------------------------------------------------------------------------------------
>
> Key: TIKA-583
> URL: https://issues.apache.org/jira/browse/TIKA-583
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
> Reporter: Dennis Adler
> Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
> SERGEY SAVCHUK, )
> ) No. 64269-3-I
> Appellant, )
> v. )
> ) UNPUBLISHED OPINION
> STEVEN G. JERDE and )
> DARLYCE J. JERDE, husband and wife )
> )
> Respondents. )
> _______________________________ ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-583) Tika 0.8 line break removal is faulty
(misses space when concatenating lines) for PDF file
Posted by "Dennis Adler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983454#action_12983454 ]
Dennis Adler commented on TIKA-583:
-----------------------------------
Ken, I tried replacing the 3 PDFBox 1.3.1 JARs (fontbox, jempbox, pdfbox) in my classpath with the 1.1.0 versions from Tika 0.7. Every PDF I tested failed with a "null" error... the old PDFbox code does not seem to work with Tika 0.8.
> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
> ------------------------------------------------------------------------------------------
>
> Key: TIKA-583
> URL: https://issues.apache.org/jira/browse/TIKA-583
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
> Reporter: Dennis Adler
> Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
> SERGEY SAVCHUK, )
> ) No. 64269-3-I
> Appellant, )
> v. )
> ) UNPUBLISHED OPINION
> STEVEN G. JERDE and )
> DARLYCE J. JERDE, husband and wife )
> )
> Respondents. )
> _______________________________ ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.