You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2008/09/19 22:04:44 UTC

[jira] Created: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Incorrect direction of extracted Arabic Text
--------------------------------------------

Key: PDFBOX-377
URL: https://issues.apache.org/jira/browse/PDFBOX-377
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.8.0-incubator
Reporter: Brian Carrier

Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character.

Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out.

The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/). It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists). It then normalizes the text to remove the presentation forms.

An example file is attached. Without the patch, the following is (incorrectly) produced:
Hello ﺪﻤﺤﻣ World.

With the patch, the following is (correctly) produced:
Hello محمد World.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Posted by Brian Carrier <ca...@digital-evidence.org>.

On Nov 21, 2008, at 8:00 AM, Jukka Zitting (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/PDFBOX-377? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12649664#action_12649664 ]
>
> Jukka Zitting commented on PDFBOX-377:
> --------------------------------------
>
> Yeah, I guess using ICU4J is reasonable in this case. Could the  
> code be organized so that ICU4J is only loaded when such  
> normalization is needed?

I changed the previous patch to make ICU4J an optional dependency.   
In the current design, the library is required to build PDFBox, but  
at runtime it is only used if ICU4J is given in the classpath. It  
seems that this will be easier to manage if Maven were being used  
instead of Ant. ICU4J could then be marked as an optional dependency.  
Is timethe only reason that PDFBox does not use Maven or are there  
other reasons?

thanks,
brian

[jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649664#action_12649664 ] 

Jukka Zitting commented on PDFBOX-377:
--------------------------------------

Yeah, I guess using ICU4J is reasonable in this case. Could the code be organized so that ICU4J is only loaded when such normalization is needed?

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).  It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier updated PDFBOX-377:
---------------------------------

    Attachment: PDFTextStripper.diff
                hello3.pdf

Example file and diff against trunk.

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).  It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648468#action_12648468 ] 

Jukka Zitting commented on PDFBOX-377:
--------------------------------------

Is there any reasonable way to achieve this without the ICU4J dependency?

There's nothing wrong with ICU4J (it's a solid piece of work with a nice license), but it's a pretty large library and having it as a mandatory dependency (with this patch the PDFTextStripper class would not even load without ICU4J in the classpath) might be troublesome for some users.


> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).  It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648638#action_12648638 ] 

Brian Carrier commented on PDFBOX-377:
--------------------------------------

There aren't any other libraries that we know of that do the Unicode normalization and reverse bidi.  An alternative to cut down on space would be to see if the licenses allow us to extract only the relevant functions. 

FYI, this patch may no longer directly drop in after the updated patch to PDFBOX-374.  I can submit a new patch after 374 is commited. 

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).  It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653529#action_12653529 ] 

Brian Carrier commented on PDFBOX-377:
--------------------------------------

I have redone this patch to make it optional at runtime and so that it takes into account other code changes that have been made since the original patch. ICU4J is needed to build PDFBOX, but it tests for the relevant classes at runtime to determine if the ICU4J components should be used.  Updated ant and maven files are included in the diff.

While working on this patch, I realized that PDFTextStripper (and its subclasses) were not consistent with how they wrote to the output (some called output directly and others used wrapper functions).  I made it more consistent and added some more wrappers (such as writeString). I then realized that some of the functions were not consistently named.  For example processLineSeparator() was not really processing a lineseparator in the PDF file (like processPage() does). It prints a line separator, so I renamed it to writeLineSeparator() and deprecated the original. There were a few other functions that I found to be inconsistently named and so I made them more consistent (and made deprecated wrappers for backwards compatibility). 

For example:
- PDFStreamEngine.showCharacter() was renamed to processTextPosition() because it doesn't always show a character and it is in the hierarchy of processXXX() functions that include processPages(), processPage(), etc. 
- Similarly, PDFStreamEngine.showString() was renamed to processEncodedText() because a) it doesn't display anything and b) it takes encoded data as input (not a normal string). 
- PDFTextStripper.flushText() was renamed to writePage() because it is the writing counterpart to processPage() and it operates at the page scale, versus document scale. 

I migrated these renames to the classes that use them to remove the deprecated warnings. 

There are three failures on the regression tests. They are improvements. 
- The 10101-AR.pdf now has more correct arabic text in it.  It is better if sorting is enabled, but the tests do not use sorting.
- The cweb.pdf and Garcia2004_thesis.pdf failures are now both better because the 'ff' ligature has been removed and replaced with "f" and "f".


> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).  It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier resolved PDFBOX-377.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0-incubator
         Assignee: Brian Carrier

Patch checked into trunk revision 734151.

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>            Assignee: Brian Carrier
>             Fix For: 0.8.0-incubator
>
>         Attachments: hello3.pdf, PDFTextStripper.diff, reorder-patch.zip
>
>
> Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).  It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier updated PDFBOX-377:
---------------------------------

    Attachment: reorder-patch.zip

Updated patch (with ICU jar file)

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff, reorder-patch.zip
>
>
> Arabic text (and other right to left languages) is stored in presentation format in PDF files, which is the opposite of the logical order that Arabic text is typically stored. Arabic text is typically stored such that the first byte is for the right-most character, but the output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead the more general form. For example, U+FB50 instead of U+0671. The presentation form is not supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).  It identifies the dominant text direction of each page and reverses the order of each line (only if any right to left text exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.