You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Masaki Komedani (Jira)" <ji...@apache.org> on 2023/02/22 06:18:00 UTC
[jira] [Comment Edited] (PDFBOX-4531) Extraction of Arabic PDF has incorrect ordering of normalized ligatures

    [ https://issues.apache.org/jira/browse/PDFBOX-4531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691948#comment-17691948 ] 

Masaki Komedani edited comment on PDFBOX-4531 at 2/22/23 6:17 AM:
------------------------------------------------------------------

[~tilman] I made the following two test documents include Arabic and Hebrew. 

[^bidi-ligature-1.pdf] has ligature (U+FB30) and (U+FEFB) for Issue 1

> 1. When there is a single code-point representing multiple letters like 'ﬁ' (U+FB01) in english or 'ﻻ' (U+FEFB) in  Arabic, the decomposed letters should be expanded in visual order (instead of logical order). In other words, before the fix words like "final" or "office" were extracted as "ifnal" and "oiffce".

[^bidi-ligature-2.pdf] has ligature (U+05D0)(U+05BC) and (U+0644)(U+0627) in decomposed form for Issue 2

> 2. The second type is when Arabic diacritics are stored with the letter in the same TextPosition. For analogy, the letter Á with Acute (U+00C1) should be expanded to letter A (U+0041) followed by ◌́ (U+0301). Without my fix, it will be expanded to acute accent followed be the letter A.


was (Author: JIRAUSER298974):
[~tilman] I made the following two test documents include Arabic and Hebrew. 

[^bidi-ligature-1.pdf] has ligature (U+FB30) and (U+FEFB) for Issue 1

> 1. When there is a single code-point representing multiple letters like 'ﬁ' (U+FB01) in english or 'ﻻ' (U+FEFB) in  Arabic, the decomposed letters should be expanded in visual order (instead of logical order). In other words, before the fix words like "final" or "office" were extracted as "ifnal" and "oiffce".

[^bidi-ligature-2.pdf] has (U+05D0)(U+05BC) and (U+0644)(U+0627) for Issue 2

> 2. The second type is when Arabic diacritics are stored with the letter in the same TextPosition. For analogy, the letter Á with Acute (U+00C1) should be expanded to letter A (U+0041) followed by ◌́ (U+0301). Without my fix, it will be expanded to acute accent followed be the letter A.

> Extraction of Arabic PDF has incorrect ordering of normalized ligatures
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-4531
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4531
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.15
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: Arabic, regression
>         Attachments: FES-GGArabisch-p112.pdf, PDFBOX-4531-reduced.pdf, PDFBOX-679-toobig.pdf, RAND_PE122z1.arabic.pdf, artikel1_20_arab.pdf, bidi-ligature-1.pdf, bidi-ligature-2.pdf, bidi-ligature.patch, diff-output.zip
>
>
> As reported by Elias Peterson in the mailing list:
> {quote}
> I think I'm seeing some issues concerning the handling of the Arabic lam-with-alef ligature.  I'm attempting to process the PDF here:
> https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf
> When I run the ExtractText command with 2.0.15 I get the following:
> $ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt
> $ head output.txt
> C O R P O R A T I O N
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات اآلنية
> االتفاق مع إيران
> األيام التي تلي
> ...
> The issue being with the last two lines in the above snippet where my understanding is that the ligature لا  was normalized but that the two letters that compose it are in the wrong order.  I was thinking that PDFBOX-684 sounded similar, and running the same PDF through 1.8.16 I see the ligature is normalized in the way I think is expected (although the interspersed English-language words are backwards here).
> $ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt
> ...
> $ head output.txt
> N O I T A R O P R O C
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات الآنية
> الاتفاق مع إيران
> الأيام التي تلي
> ...
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org