You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dennis Adler (JIRA)" <ji...@apache.org> on 2011/01/14 02:12:45 UTC

[jira] Created: (TIKA-583) Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file

Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
------------------------------------------------------------------------------------------

                 Key: TIKA-583
                 URL: https://issues.apache.org/jira/browse/TIKA-583
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.8
         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
            Reporter: Dennis Adler


The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
------- start ---------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
DIVISION ONE
  SERGEY SAVCHUK, )
 ) No. 64269-3-I
 Appellant, )
 v. )
 ) UNPUBLISHED OPINION
 STEVEN G. JERDE and )
 DARLYCE J. JERDE, husband and wife )
)
 Respondents. )
 _______________________________  ) FILED: November 1, 2010
--------------- end ---------------------

Tika 0.8 has this instead:
-------------- start ---------------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________  )FILED: November 1, 2010schindler, j
--------------- end ---------------------

Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-583) Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-583.
--------------------------------

    Resolution: Duplicate
      Assignee: Jukka Zitting

This is a duplicate of TIKA-548, fixed in trunk.

> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-583
>                 URL: https://issues.apache.org/jira/browse/TIKA-583
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
>            Reporter: Dennis Adler
>            Assignee: Jukka Zitting
>         Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
>   SERGEY SAVCHUK, )
>  ) No. 64269-3-I
>  Appellant, )
>  v. )
>  ) UNPUBLISHED OPINION
>  STEVEN G. JERDE and )
>  DARLYCE J. JERDE, husband and wife )
> )
>  Respondents. )
>  _______________________________  ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________  )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-583) Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file

Posted by "Dennis Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Adler updated TIKA-583:
------------------------------

    Attachment: Savchuk v. Jerde.pdf

Original PDF; parsed with tika-app-0.7 and tika-app-0.8 (release). Sample text in the bug report from the "Plain text" tabs. Found this file on the web, so should be fine for ASF inclusion.

> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-583
>                 URL: https://issues.apache.org/jira/browse/TIKA-583
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
>            Reporter: Dennis Adler
>         Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
>   SERGEY SAVCHUK, )
>  ) No. 64269-3-I
>  Appellant, )
>  v. )
>  ) UNPUBLISHED OPINION
>  STEVEN G. JERDE and )
>  DARLYCE J. JERDE, husband and wife )
> )
>  Respondents. )
>  _______________________________  ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________  )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-583) Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981822#action_12981822 ] 

Ken Krugler commented on TIKA-583:
----------------------------------

Is this a PDFBox issue or a Tika issue? Any chance you could re-run it with Tika 0.8, but using the PDFBox jar from Tika 0.7?

> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-583
>                 URL: https://issues.apache.org/jira/browse/TIKA-583
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
>            Reporter: Dennis Adler
>         Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
>   SERGEY SAVCHUK, )
>  ) No. 64269-3-I
>  Appellant, )
>  v. )
>  ) UNPUBLISHED OPINION
>  STEVEN G. JERDE and )
>  DARLYCE J. JERDE, husband and wife )
> )
>  Respondents. )
>  _______________________________  ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________  )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-583) Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file

Posted by "Dennis Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983454#action_12983454 ] 

Dennis Adler commented on TIKA-583:
-----------------------------------

Ken, I tried replacing the 3 PDFBox 1.3.1 JARs (fontbox, jempbox, pdfbox) in my classpath with the 1.1.0 versions from Tika 0.7. Every PDF I tested failed with a "null" error... the old PDFbox code does not seem to work with Tika 0.8.

> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-583
>                 URL: https://issues.apache.org/jira/browse/TIKA-583
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
>            Reporter: Dennis Adler
>         Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
>   SERGEY SAVCHUK, )
>  ) No. 64269-3-I
>  Appellant, )
>  v. )
>  ) UNPUBLISHED OPINION
>  STEVEN G. JERDE and )
>  DARLYCE J. JERDE, husband and wife )
> )
>  Respondents. )
>  _______________________________  ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________  )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.