You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Thomas Fischer (JIRA)" <ji...@apache.org> on 2011/03/05 16:52:45 UTC

[jira] Created: (PDFBOX-970) TeX-created ligatures and umlauts are not recognised

TeX-created ligatures and umlauts are not recognised
----------------------------------------------------

                 Key: PDFBOX-970
                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
             Project: PDFBox
          Issue Type: Bug
          Components: FontBox
    Affects Versions: 1.5.0
         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)

            Reporter: Thomas Fischer


Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
  1.4		 1.5
official	ocial
effort 	e ort
fields 	elds
first	         rst
In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Updated: (PDFBOX-970) TeX-created ligatures and umlauts are not recognised

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer updated PDFBOX-970:
----------------------------------

    Attachment: Test2.pdf
                Test2-1.6.txt
                Test2.1.4.txt

I put a file icu-4.0.1.jar into my classpath and that essentially resolved the umlaut issue, they are now represented as combined characters (I'm not quite sure what search engines do with those). Nevertheless, pdfbox 1.4 didn't need the additional icu, was the need introduced in a recent version change?
Unfortunately there are still some strange problems with the conversion, essentially missing characters. I upload a new test file and conversions using pdfbox 1.4 and 1.6 respectively; comparison shows the errors (and some additional differences).

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and Querying.txt, A Python Library for Provenance Recording and Querying.txt, Test.pdf, Test.pdf, Test2-1.6.txt, Test2.1.4.txt, Test2.pdf
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
>   1.4		 1.5
> official	ocial
> effort 	e ort
> fields 	elds
> first	         rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PDFBOX-970) TeX-created ligatures and umlauts are not recognised

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer updated PDFBOX-970:
----------------------------------

    Attachment: A Python Library for Provenance Recording and Querying.txt
                A Python Library for Provenance Recording and Querying.txt

A PDF file and the respective text extractions with v. 1.4 and v. 1.5 from http://www.aero-grid.de/ergebnisse/publikationen/ipaw08-id43-bochner-gude-schreiber.pdf

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and Querying.txt, A Python Library for Provenance Recording and Querying.txt
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
>   1.4		 1.5
> official	ocial
> effort 	e ort
> fields 	elds
> first	         rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (PDFBOX-970) TeX-created ligatures and umlauts are not recognised

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003459#comment-13003459 ] 

Andreas Lehmkühler commented on PDFBOX-970:
-------------------------------------------

I can't confirm the umlaut issue. The latest snapshot works fine for me. Do you have the icu-jar on your classpath?

The position of the german quote seems to be misinterpreted. Because of being placed very low on the line the algo presumes is has to be on the next line. It was already an issue with 1.4.0

I guess the JIRA error occured because of some maintenance ( the infra guys just upgraded JIRA to 4.2.4).

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and Querying.txt, A Python Library for Provenance Recording and Querying.txt, Test.pdf, Test.pdf
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
>   1.4		 1.5
> official	ocial
> effort 	e ort
> fields 	elds
> first	         rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-970) TeX-created ligatures and umlauts are not recognised

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003189#comment-13003189 ] 

Andreas Lehmkühler commented on PDFBOX-970:
-------------------------------------------

I solved the issue in revision 1078518. But I can only confirm that it works for ligatures as your example doesn't contain any german umlauts. Can you provide us with an other example or can you confirm that this solution also works for that kind of pdfs?

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and Querying.txt, A Python Library for Provenance Recording and Querying.txt
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
>   1.4		 1.5
> official	ocial
> effort 	e ort
> fields 	elds
> first	         rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Updated: (PDFBOX-970) TeX-created ligatures and umlauts are not recognised

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer updated PDFBOX-970:
----------------------------------

    Attachment: Test.pdf

I downloaded and built revision 1078518 (pdfbox-1.6.0-SNAPSHOT.jar with font.box and jemp.box). While the ligatures seem to be OK, the umlauts are not: ü is represented as u¨ etc. (not a combining ¨).  Furthermore, '„', opening German quote, is represented as '\n”\n' (a line break before and after a closing German quote). I try to attach a test file Test.pdf (I didn't succeed yesterday; where do I report errors of jira?).

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and Querying.txt, A Python Library for Provenance Recording and Querying.txt, Test.pdf, Test.pdf
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
>   1.4		 1.5
> official	ocial
> effort 	e ort
> fields 	elds
> first	         rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PDFBOX-970) TeX-created ligatures and umlauts are not recognised

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer updated PDFBOX-970:
----------------------------------

    Attachment: Test.pdf

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and Querying.txt, A Python Library for Provenance Recording and Querying.txt, Test.pdf, Test.pdf
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
>   1.4		 1.5
> official	ocial
> effort 	e ort
> fields 	elds
> first	         rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira