You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by John Hewson <jo...@jahewson.com> on 2016/08/02 06:45:54 UTC
Re: TextPosition.getIndividualWidths() returns array with less items than expected

> On 20 Jul 2016, at 14:33, Ygor Mutti <yg...@jusbrasil.com.br> wrote:
> 
> IMHO, the responsibilities are messed up in this case.
> 
> I'm surprised to find out that Unicode deals with typographic sugar like
> ligatures. This could be much more conveniently handled by the font using
> separate glyphs.

Yes, indeed. There's only a handful of ligatures in Unicode for backwards compatibility with legacy systems. 

> Also, I think only text search algorithms, not PDF authoring tools, should
> concern about searches using approximations. We already have to deal with
> PDF authors that don't approximate uncommon glyphs, so we have to handle
> them during text search anyway.

I think you might be after the "compatibility decomposition" defined by Unicode.

> I've solved the problem by determining the width of each character in the
> Unicode string as the width of the ligature divided by the length of the
> string. This is adequate for our purposes.

That's the approach which is use for placing a caret inside a ligature, so it's a decent choice.

-- John

> Thank you, Tilman and John, for the help!
> 
>> On Tue, Jul 19, 2016 at 6:48 PM John Hewson <jo...@jahewson.com> wrote:
>> 
>> 
>>> On 19 Jul 2016, at 14:28, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>> Am 19.07.2016 um 23:09 schrieb Ygor Mutti:
>>>> Yes, it helps. Thank you for the prompt answer!
>>>> 
>>>> I wonder why the string returned by getUnicode contains the separate
>> chars
>>>> instead of the ligature. Is there some way I can configure
>> PDFTextStripper
>>>> to decode it as it is in the PDF?
>>> 
>>> No, I don't know.
>>> 
>>> The reason that it is decoded the way it is is the CMap table, which
>> looks like this and tells what to do with the codes in the PDF
>> 
>> You mean the ToUnicode CMap (that’s what’s below). The CMap is found in
>> the Encoding entry and maps a character code to a CID.
>> 
>>> 
>>> /CIDInit /ProcSet findresource begin
>>> 12 dict begin
>>> begincmap
>>> /CIDSystemInfo
>>> << /Registry (Adobe)
>>> /Ordering (UCS) /Supplement 0 >> def
>>> /CMapName /Adobe-Identity-UCS def
>>> /CMapType 2 def
>>> 1 begincodespacerange
>>> <0000> <FFFF>
>>> endcodespacerange
>>> 100 beginbfchar
>>> <1D> <0066006C>   <============ fl
>>> <1E> <2212>
>>> <1F> <00660069>    <=========== fi
>>> (...)
>>> 
>>> 1F = octal 037 decodes to 00660069 i.e. two unicode characters, f and i.
>>> 
>>> Think about it... if it would decode to the "fi" unicode character, you
>> wouldn't be able to text-search for "Justificação" easily in an extracted
>> text.
>> 
>> Indeed. The ToUnicode CMap in this PDF specifies that the the “fi” glyph
>> represents “f” and “i” in Unicode.
>> 
>> — John
>> 
>>> Tilman
>>> 
>>> 
>>>> 
>>>> On Tue, Jul 19, 2016 at 4:47 PM Tilman Hausherr <TH...@t-online.de>
>>>> wrote:
>>>> 
>>>>>> Am 19.07.2016 um 20:43 schrieb Ygor Mutti:
>>>>>> Hi!
>>>>>> 
>>>>>> The javadoc states that the TextPosition.getIndividualWidths() method
>>>>>> returns "An array that is the same length as the length of the
>> string."
>>>>>> Here is a gist containing a test case in which this statement is
>> false:
>>>>>> https://gist.github.com/ygormutti/d40a80d425d552151625a063fb29c9ca
>>>>> I'd say the javadoc is wrong. It is the length of the CharacterCodes
>>>>> array, not the length of the unicode string. The "fi" in Justificação
>> is
>>>>> one glyph, a ligature.
>>>>> 
>>>>> This is the content stream:
>>>>> 
>>>>> [ (J) 20 (usti\037ca\347\343o) ] TJ
>>>>> 
>>>>> Does this explanation help?
>>>>> 
>>>>> Tilman
>>>>> 
>>>>>> It prints a line for two cases where the TextPosition.getUnicode()
>>>>> returns
>>>>>> "fi" while at the same time TextPosition,getIndividualWidths()
>> returns an
>>>>>> array containing a single float.
>>>>>> 
>>>>>> I've tried to pin down the version in which this behavior has been
>>>>>> introduced and found out it works as expected in 1.2.1 release and
>> does
>>>>> not
>>>>>> since 1.3.0.
>>>>>> 
>>>>>> Should I open a ticket for this?
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
>> users-unsubscribe@pdfbox.apache.org>
>>> For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:
>> users-help@pdfbox.apache.org>
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org