You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2019/04/27 15:52:00 UTC

[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

    [ https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827647#comment-16827647 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/27/19 3:51 PM:
------------------------------------------------------------------

This has been a year and I wanted to look what's going on and concentrated on the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the consonant it is "influencing", but when composed with an editor, it is to be after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the "scythe" glyph with two different consonants. The result was [^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction is wrong 🤣. So that is really funny, but the downside is that for now, we have no "gold standard" to look up to.


was (Author: tilman):
This has been a year and I wanted to look what's going on and concentrated on the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct extraction.

example 3 has correct visual glyphs sequence but incorrect extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the consonant it is "influencing", but when composed with an editor, it is to be after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the "scythe" glyph with two different consonants. The result was [^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction is wrong 🤣. So that is really funny, but the downside is that for now, we have no "gold standard" to look up to.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB table
> ----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4189
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4189
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: FontBox, PDModel
>            Reporter: Palash Ray
>            Priority: Major
>         Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, bengali-word-lohit-good.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph substitution. The GSUB table has been read and used effectively to replace some compound words with their respective Glyphs. All tests are passing. I have tested this for the Bengali font. Please review these changes and let me know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org