You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Thomas Fischer (JIRA)" <ji...@apache.org> on 2010/05/16 13:40:42 UTC

[jira] Created: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Text extracted from a TeX-created PDF file comes in some form of hex encoding
-----------------------------------------------------------------------------

                 Key: PDFBOX-728
                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.1.0
         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
            Reporter: Thomas Fischer


The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:

x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
F(X, t) = ∂x∂X (X, t).
A Perl command like
s/x([\da-f]{2})/chr(hex($1))/eg;
will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
s/ÿ/ß/g;
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870078#action_12870078 ] 

Andreas Lehmkühler commented on PDFBOX-728:
-------------------------------------------

With version 947097 I've added some more mappings 

- ligatures
- the special charcters mentioned by thomas

and I've improved the decoding of the character encoding used by tex/latex:

- the range from 0-9 is mapped to 160-169
- the range from 10-31 is mapped to 173-..
- 127 is mapped to 196


> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Priority: Minor
>             Fix For: 1.2.0
>
>         Attachments: wias_preprints_1401.pdf, wias_preprints_1401.txt, wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-728.
---------------------------------------

    Fix Version/s: 1.2.0
       Resolution: Fixed

This issue seems to be resolved with solving PDFBOX-534. I've attached the extracted text including the fix.

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>             Fix For: 1.2.0
>
>         Attachments: wias_preprints_1401.pdf, wias_preprints_1401.txt, wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer updated PDFBOX-728:
----------------------------------

    Attachment: wias_preprints_1401.pdf
                wias_preprints_1401.txt

A PDF file showing this behaviour and is translation into text using
org.apache.pdfbox.ExtractText -encoding UTF-8

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>         Attachments: wias_preprints_1401.pdf, wias_preprints_1401.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated PDFBOX-728:
---------------------------------

    Fix Version/s:     (was: 1.2.0)

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>         Attachments: wias_preprints_1401-v1.2.0SNAPSHOT.txt, wias_preprints_1401.pdf, wias_preprints_1401.txt, wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler reassigned PDFBOX-728:
-----------------------------------------

    Assignee: Andreas Lehmkühler

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.2.0
>
>         Attachments: wias_preprints_1401-v1.2.0SNAPSHOT.txt, wias_preprints_1401.pdf, wias_preprints_1401.txt, wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer updated PDFBOX-728:
----------------------------------

    Attachment: wias_preprints_1401-v1.2.0SNAPSHOT.txt

I have compiled pdfbox-1.2.0-SNAPSHOT.jar and extracted the text from wias_preprints_1401.pdf.
The result shows a significant improvement over the previous version.

The only major error remaining is that n-dash (ASCII 21 or \x15 in the original) is mapped to ASCII 184 (\xB8) instead of 8211 (x2013): (1.1)¸(1.5) instead of (1.1)-(1.5).

A minor error is
H1 ↪→ Lp instead of H1 ↪ Lp, this is due to the TeX construction used:
arrowhookleft + arrow is supposed to create a left hook and combine it with an arrow, I don't know if there is such a construction in Unicode, so I suggested to use "RIGHTWARDS ARROW WITH HOOK" for the *combination* of the two characters.

But I will have to test how these adjustments work on TeX files created in other contexts.
(BTW, the erroneous ß in "we can somewhat relax the ßmallnessrequirement" is also contained in the original PDF, so no error on the side of PDFBox.)

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Priority: Minor
>             Fix For: 1.2.0
>
>         Attachments: wias_preprints_1401-v1.2.0SNAPSHOT.txt, wias_preprints_1401.pdf, wias_preprints_1401.txt, wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer updated PDFBOX-728:
----------------------------------

    Priority: Minor  (was: Major)

While the major problems appear to be fixed, the attached file wias_preprints_1401_r944875.txt shows still some minor mistakes. I hope these can be solved on a general level and are not specific to this particular file (I found this behaviour in all the hex-encoded files).

The following additional replacements are required (Perl notation), note that the characters to be repacled in the ACII range 1-31 are usually not printable:
	s/\x15/-/g; 	#	 = 21 = x15
	s/\x1b/ff/g; 	#	 = 27 = x1b
	s/\x1c/fi/g;	#	 = 28 = x1c
	s/\x1d/fl/g;	#	 = 29 = x1d
	s/\x1e/ffi/g;	#	 = 30 = x1e
	s/\x8a/Ł/g;	#	 = 138 = x8a
	s/\xff/ß/g;		#	ÿ = 255 = xff

Furthermore, some of these files use non-standard TeX notation, I don't know if that can be dealt with (the character in quotes is present in the file):
"%" 	arrownortheast should probably be \nearrow (TeX):
↗ 	NORTH EAST ARROW

",→"	arrowhookleft + arrow should probably be \hookrightarrow (TeX):
↪ 	RIGHTWARDS ARROW WITH HOOK

"*" 	arrowrighttophalf should probably be \rightharpoonup (TeX):
⇀ 	RIGHTWARDS HARPOON WITH BARB UPWARDS

And finally:
f prime should be replaced by f′ instead of f ′ (no space)

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Priority: Minor
>             Fix For: 1.2.0
>
>         Attachments: wias_preprints_1401.pdf, wias_preprints_1401.txt, wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-728:
--------------------------------------

    Attachment: wias_preprints_1401_r944875.txt

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>         Attachments: wias_preprints_1401.pdf, wias_preprints_1401.txt, wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer reopened PDFBOX-728:
-----------------------------------


See last comment.

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Priority: Minor
>             Fix For: 1.2.0
>
>         Attachments: wias_preprints_1401.pdf, wias_preprints_1401.txt, wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in a hex-encoded form, probably interspersed with some non encoded characters as in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.