You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Meier (JIRA)" <ji...@apache.org> on 2015/09/23 08:22:04 UTC

[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

    [ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904040#comment-14904040 ] 

Andreas Meier commented on PDFBOX-2252:
---------------------------------------

I want provide a new patch to address this Problem, unfortunately the layout/format of my code is different than the one in the pdfbox svn trunk. I have already added and activated pdfbox-eclipse-formatter.xml to my IDE, but I am still missing something.

Is there a cleanup configuration or some other formatting configuration I missed?

Example:

{code:title=PDFTextStripper.java|borderStyle=solid}
 /**
  * a list of regular expressions that match commonly used
  * list item formats, i.e. bullets, numbers, letters,
  * Roman numerals, etc. Not meant to be
  * comprehensive.
  */
private static final String[] LIST_ITEM_EXPRESSIONS = {
    "\\.",
    "\\d+\\.",
    "\\[\\d+\\]",
    "\\d+\\)",
    "[A-Z]\\.",
    "[a-z]\\.",
    "[A-Z]\\)",
    "[a-z]\\)",
    "[IVXL]+\\.",
    "[ivxl]+\\.",
};
{code}

 is formatted to sth. like that:

{code:title=PDFTextStripper.java|borderStyle=solid}
 /**
  * a list of regular expressions that match commonly used list item formats, i.e. bullets, numbers, letters,
  * Roman numerals, etc. Not meant to be comprehensive.
  */
private static final String[] LIST_ITEM_EXPRESSIONS = { "\\.", "\\d+\\.", "\\[\\d+\\]",
    "\\d+\\)",  "[A-Z]\\.",  "[a-z]\\.",  "[A-Z]\\)",  "[a-z]\\)",  "[IVXL]+\\.",   "[ivxl]+\\.", };
{code}

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: PDFTextStripper.java.patch, atest.pdf, overlap.jpg, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org