You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Luiz Marcelo Modesto <lm...@gmail.com> on 2024/03/13 19:03:13 UTC

Type 0 font - Text extraction X PDF Debugger

Hi everyone,

    I'm not sure if this is the same as FAQ "How come I am getting
gibberish(G38G43G36G51G5) when extracting text?"...

    I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build
11.0.22+7-post-Ubuntu-0ubuntu222.04.1).

    I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)

  BT
  /G1F7 6.0 Tf
  94.871 773.806 Td
  <004200430044> Tj
  ET

    becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader,
Chrome, ...) and becomes "abc" on PDFBox text extraction tool.

    Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.

    The renders that allow me to copy the text give me "BCD" text.

    It seems that PDFBox extraction tool follows the item "9.10.2 Mapping
character codes to Unicode values" (ISO 32000-2:2020) but all the others
choose a different way.

     Could you help me to understand if there is a problem with the PDF
file, with the renders or with the extract text tool?

Thank you!

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Luiz Marcelo Modesto <lm...@gmail.com>.

After reading a lot of documentation again, I've changed my mind about what
I wrote before.

1)  "It's only shorter than the one I could have if I write several blocks
of beginbfchar/endbfchar."

begincidrange/endcidrange is a short form to several
begincidchar/endcidchar blocks.

beginbfrange/endbfrange is the correct short form to several
beginbfchar/endbfchar blocks.

2) "I'm sorry if I misunderstood, but this is a valid CMap too (it seems a
kind of Identity mapping too, except for the 0x00...), isn't it?"

It could be a valid CMap, but not for the text extraction purpose.

Item 9.10.3 is clear when a CMap serves to this purpose:

"It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange
operators to define the mapping from character codes to Unicode character
sequences expressed in UTF-16BE encoding."

3) "If I've looked at the correct CMap file
(fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
lot of blocks of beginbfchar/endbfchar. It doesn't have any
beginbfchar/endbfchar block."

The file has a lot of begincidrange/beginendrange blocks.

In fact, it doesn't have any beginbfchar/endbfchar block. (conflicts with
item 9.10.3...)

About debugging the extraction text tool:

1) Identity resolution uses this codding pattern at PDFont.java to obtain
the Unicode value:

new String(new char[] { (char) code })

Something similar can be found at LegacyPDFStreamEngine.java

My final thoughts:

1) Thank you Tilman for your help!

2) I think the tools that extract the "BCD" text could be partially
ignoring the CMap (because it is invalid for text extraction - it doesn't
contain beginbfchar/endbfchar or beginbfrange/endbfrange). So, maybe they
don't try the five steps (letters "a" to "e") from item 9.10.2. Maybe their
choice is the "identity" transformation for a failed Unicode production...

"If these methods fail to produce a Unicode value, there is no way to
determine what the character code represents in which case a PDF processor
may choose a character code of their choosing."

3) I don't have any suggestions for a code change that could be a good
solution. Maybe, I'll have to extract text for some thousands of PDFs like
the "pag4_alt.pdf". In this case, I'll change the code with something like
the file "identityForBadToUnicodeCMap.patch" that I've droped to the shared
folder.






Em sex., 15 de mar. de 2024 às 10:54, Tilman Hausherr <TH...@t-online.de>
escreveu:

> Yes identity does work for that file. However using that logic fails to
> provide the correct results for other files with an unusuable /ToUnicode
> stream.
>
> Yes there can be larger blocks.
>
> My suspicion is that the tools who use "identity" for your file will
> fail for some of the files. Unless we discover yet another tweak of that
> workaround algorithm that works with all.
>
> Tilman
>
> On 15.03.2024 14:28, Luiz Marcelo Modesto wrote:
> > Thank you Tilman!
> >
> > I'll try to read ISO 32000-2:2020 again to look for some kind of
> precedence
> > rules regarding the way of decoding string codes to Unicode chars.
> >
> > My impression is that there are some choices but I don't remember if
> there
> > is something assertive or not. Maybe it could be just an implementation
> > choice.
> >
> > I'll try to debug the extraction text tool to verify why using the
> > predefined Identity CMap works.
> >
> > If I've looked at the correct CMap file
> > (fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
> > lot of blocks of beginbfchar/endbfchar. It doesn't have any
> > beginbfchar/endbfchar block.
> >
> > All the blocks have their length limited to 256 codes, but it seems
> PDFBox
> > can support larger blocks. But, maybe the set "<0100> <FFFF> 256" could
> be
> > a problem...
> >
> > PS.: The use of "true" was just a fast and dirty way to do a fast test,
> as
> > the beginbfchar/endbfchar block suggested to me an identity mapping.
> >
> >
> >
> >
> > Em sex., 15 de mar. de 2024 às 01:35, Tilman Hausherr <
> THausherr@t-online.de>
> > escreveu:
> >
> >> You are correct that it's the "fb" parts that are missing. (And some of
> >> the other tools you tried also mention this)
> >>
> >> Just adding true results in text extraction of several files no longer
> >> being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf
> >> PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
> >>
> >> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings
> >> no regressions but your text is not extracted properly.
> >>
> >> Maybe it is possible to include yet another rule for your file, but
> >> there's likely more to do and there is the risk that other files no
> >> longer extract properly.
> >>
> >> Tilman
> >>
> >> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
> >>> It seems that PDFBOX-5540 resolves a special case based on some
> >> dictionary
> >>> properties and chooses a predefined CMap (Identity CMap).
> >>>
> >>> Reading the PDFont.java code, I think the warning "Invalid ToUnicode
> CMap
> >>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
> >>> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
> >>>
> >>> The two CMap's HashMaps (charToUnicodeOneByte and
> charToUnicodeTwoBytes)
> >>> are really empty.
> >>>
> >>> But the font CMap stream contains this block:
> >>>
> >>> 2 begincidrange
> >>> <0001> <00FF> 1
> >>> <0100> <FFFF> 256
> >>> endcidrange
> >>>
> >>> I'm sorry if I misunderstood, but this is a valid CMap too (it seems a
> >> kind
> >>> of Identity mapping too, except for the 0x00...), isn't it?
> >>>
> >>> It's only shorter than the one I could have if I write several blocks
> of
> >>> beginbfchar/endbfchar.
> >>>
> >>> If I make this "dumb" modification (adding "true" to conditions) just
> >> for a
> >>> rapid test
> >>>
> >>> if (cmapName.contains("Identity") //
> >>> || ordering.contains("Identity") //
> >>> || COSName.IDENTITY_H.equals(encoding) //
> >>> || COSName.IDENTITY_V.equals(encoding) || true)
> >>> {
> >>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
> >>> if (true || encodingDict == null || !encodingDict.containsKey(COSName.
> >>> DIFFERENCES))
> >>> {
> >>> // assume that if encoding is identity, then the reverse is also true
> >>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
> >>> LOG.warn("Using predefined identity CMap instead");
> >>> }
> >>> }
> >>>
> >>> I've got "BCD" string like all the others
> >>>
> >>> The encoding parameter is ignored when writing to the console.
> >>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
> >>> loadUnicodeCmap
> >>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
> >>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
> >>> loadUnicodeCmap
> >>> ADVERTÊNCIA: Using predefined identity CMap instead
> >>> Página 4 de 4
> >>> Informações:  BCD
> >>>
> >>> Maybe the extract text tool should been using begincidrange/endcidrange
> >>> information...
> >>>
> >>> What do you think about?
> >>>
> >>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
> >>> Maybe I'm missing something... I'm sorry if this is the case...
> >>>
> >>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
> >>> lmodesto.work@gmail.com> escreveu:
> >>>
> >>>> Ok!
> >>>>
> >>>> I'll read PDFBOX-5540 and related issues.
> >>>>
> >>>> Thank you very much!
> >>>>
> >>>>
> >>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <
> THausherr@t-online.de
> >>>> escreveu:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> The problem is in the ToUnicode stream, there's a log message
> "Invalid
> >>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode
> mappings.
> >>>>> PDFBox is trying a fallback solution which turns out to be wrong.
> This
> >>>>> is related to PDFBOX-5540 and earlier related issues.
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
> >>>>>> Hi Tilman!
> >>>>>>
> >>>>>>        Thank you very much for your attention!
> >>>>>>
> >>>>>>        You can find the file "p4_alt.pdf" in this folder
> >>>>>> <
> >>
> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
> >>>>>> .
> >>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and
> others.
> >>>>>>
> >>>>>>        I'm sorry, I sent the pdf file as an attachment in my first
> >>>>> message,
> >>>>>> but I didn't know that it wouldn't work.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
> >>>>> THausherr@t-online.de>
> >>>>>> escreveu:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> please upload your file to a sharehoster.
> >>>>>>>
> >>>>>>> Tilman
> >>>>>>>
> >>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>>        I'm not sure if this is the same as FAQ "How come I am
> getting
> >>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
> >>>>>>>>
> >>>>>>>>        I'm using PDFBox version 3.0.1 and OpenJDK Runtime
> Environment
> >>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> >>>>>>>>
> >>>>>>>>        I'm trying to understand how this PDF chunk (from
> p4_fix.pdf
> >>>>>>> attached)
> >>>>>>>>      BT
> >>>>>>>>      /G1F7 6.0 Tf
> >>>>>>>>      94.871 773.806 Td
> >>>>>>>>      <004200430044> Tj
> >>>>>>>>      ET
> >>>>>>>>
> >>>>>>>>        becomes "BCD" on PDFBox Debugger (the same on qpdfview,
> Adobe
> >>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction
> >> tool.
> >>>>>>>>        Using the Poppler pdftotext (version 22.02.0) gives me
> "BCD"
> >> too.
> >>>>>>>>        The renders that allow me to copy the text give me "BCD"
> text.
> >>>>>>>>
> >>>>>>>>        It seems that PDFBox extraction tool follows the item
> "9.10.2
> >>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but
> >> all
> >>>>>>>> the others choose a different way.
> >>>>>>>>
> >>>>>>>>         Could you help me to understand if there is a problem with
> >> the
> >>>>>>>> PDF file, with the renders or with the extract text tool?
> >>>>>>>>
> >>>>>>>> Thank you!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>
> >>>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>
> >>>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Tilman Hausherr <TH...@t-online.de>.

Yes identity does work for that file. However using that logic fails to 
provide the correct results for other files with an unusuable /ToUnicode 
stream.

Yes there can be larger blocks.

My suspicion is that the tools who use "identity" for your file will 
fail for some of the files. Unless we discover yet another tweak of that 
workaround algorithm that works with all.

Tilman

On 15.03.2024 14:28, Luiz Marcelo Modesto wrote:
> Thank you Tilman!
>
> I'll try to read ISO 32000-2:2020 again to look for some kind of precedence
> rules regarding the way of decoding string codes to Unicode chars.
>
> My impression is that there are some choices but I don't remember if there
> is something assertive or not. Maybe it could be just an implementation
> choice.
>
> I'll try to debug the extraction text tool to verify why using the
> predefined Identity CMap works.
>
> If I've looked at the correct CMap file
> (fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
> lot of blocks of beginbfchar/endbfchar. It doesn't have any
> beginbfchar/endbfchar block.
>
> All the blocks have their length limited to 256 codes, but it seems PDFBox
> can support larger blocks. But, maybe the set "<0100> <FFFF> 256" could be
> a problem...
>
> PS.: The use of "true" was just a fast and dirty way to do a fast test, as
> the beginbfchar/endbfchar block suggested to me an identity mapping.
>
>
>
>
> Em sex., 15 de mar. de 2024 às 01:35, Tilman Hausherr <TH...@t-online.de>
> escreveu:
>
>> You are correct that it's the "fb" parts that are missing. (And some of
>> the other tools you tried also mention this)
>>
>> Just adding true results in text extraction of several files no longer
>> being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf
>> PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
>>
>> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings
>> no regressions but your text is not extracted properly.
>>
>> Maybe it is possible to include yet another rule for your file, but
>> there's likely more to do and there is the risk that other files no
>> longer extract properly.
>>
>> Tilman
>>
>> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
>>> It seems that PDFBOX-5540 resolves a special case based on some
>> dictionary
>>> properties and chooses a predefined CMap (Identity CMap).
>>>
>>> Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
>>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
>>> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
>>>
>>> The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
>>> are really empty.
>>>
>>> But the font CMap stream contains this block:
>>>
>>> 2 begincidrange
>>> <0001> <00FF> 1
>>> <0100> <FFFF> 256
>>> endcidrange
>>>
>>> I'm sorry if I misunderstood, but this is a valid CMap too (it seems a
>> kind
>>> of Identity mapping too, except for the 0x00...), isn't it?
>>>
>>> It's only shorter than the one I could have if I write several blocks of
>>> beginbfchar/endbfchar.
>>>
>>> If I make this "dumb" modification (adding "true" to conditions) just
>> for a
>>> rapid test
>>>
>>> if (cmapName.contains("Identity") //
>>> || ordering.contains("Identity") //
>>> || COSName.IDENTITY_H.equals(encoding) //
>>> || COSName.IDENTITY_V.equals(encoding) || true)
>>> {
>>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
>>> if (true || encodingDict == null || !encodingDict.containsKey(COSName.
>>> DIFFERENCES))
>>> {
>>> // assume that if encoding is identity, then the reverse is also true
>>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
>>> LOG.warn("Using predefined identity CMap instead");
>>> }
>>> }
>>>
>>> I've got "BCD" string like all the others
>>>
>>> The encoding parameter is ignored when writing to the console.
>>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
>>> loadUnicodeCmap
>>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
>>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
>>> loadUnicodeCmap
>>> ADVERTÊNCIA: Using predefined identity CMap instead
>>> Página 4 de 4
>>> Informações:  BCD
>>>
>>> Maybe the extract text tool should been using begincidrange/endcidrange
>>> information...
>>>
>>> What do you think about?
>>>
>>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
>>> Maybe I'm missing something... I'm sorry if this is the case...
>>>
>>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
>>> lmodesto.work@gmail.com> escreveu:
>>>
>>>> Ok!
>>>>
>>>> I'll read PDFBOX-5540 and related issues.
>>>>
>>>> Thank you very much!
>>>>
>>>>
>>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <THausherr@t-online.de
>>>> escreveu:
>>>>
>>>>> Hi,
>>>>>
>>>>> The problem is in the ToUnicode stream, there's a log message "Invalid
>>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
>>>>> PDFBox is trying a fallback solution which turns out to be wrong. This
>>>>> is related to PDFBOX-5540 and earlier related issues.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>>>>>> Hi Tilman!
>>>>>>
>>>>>>        Thank you very much for your attention!
>>>>>>
>>>>>>        You can find the file "p4_alt.pdf" in this folder
>>>>>> <
>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
>>>>>> .
>>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and others.
>>>>>>
>>>>>>        I'm sorry, I sent the pdf file as an attachment in my first
>>>>> message,
>>>>>> but I didn't know that it wouldn't work.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>>> escreveu:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> please upload your file to a sharehoster.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>>        I'm not sure if this is the same as FAQ "How come I am getting
>>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
>>>>>>>>
>>>>>>>>        I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
>>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>>>>>>>>
>>>>>>>>        I'm trying to understand how this PDF chunk (from p4_fix.pdf
>>>>>>> attached)
>>>>>>>>      BT
>>>>>>>>      /G1F7 6.0 Tf
>>>>>>>>      94.871 773.806 Td
>>>>>>>>      <004200430044> Tj
>>>>>>>>      ET
>>>>>>>>
>>>>>>>>        becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
>>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction
>> tool.
>>>>>>>>        Using the Poppler pdftotext (version 22.02.0) gives me "BCD"
>> too.
>>>>>>>>        The renders that allow me to copy the text give me "BCD" text.
>>>>>>>>
>>>>>>>>        It seems that PDFBox extraction tool follows the item "9.10.2
>>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but
>> all
>>>>>>>> the others choose a different way.
>>>>>>>>
>>>>>>>>         Could you help me to understand if there is a problem with
>> the
>>>>>>>> PDF file, with the renders or with the extract text tool?
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Luiz Marcelo Modesto <lm...@gmail.com>.

Thank you Tilman!

I'll try to read ISO 32000-2:2020 again to look for some kind of precedence
rules regarding the way of decoding string codes to Unicode chars.

My impression is that there are some choices but I don't remember if there
is something assertive or not. Maybe it could be just an implementation
choice.

I'll try to debug the extraction text tool to verify why using the
predefined Identity CMap works.

If I've looked at the correct CMap file
(fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
lot of blocks of beginbfchar/endbfchar. It doesn't have any
beginbfchar/endbfchar block.

All the blocks have their length limited to 256 codes, but it seems PDFBox
can support larger blocks. But, maybe the set "<0100> <FFFF> 256" could be
a problem...

PS.: The use of "true" was just a fast and dirty way to do a fast test, as
the beginbfchar/endbfchar block suggested to me an identity mapping.




Em sex., 15 de mar. de 2024 às 01:35, Tilman Hausherr <TH...@t-online.de>
escreveu:

> You are correct that it's the "fb" parts that are missing. (And some of
> the other tools you tried also mention this)
>
> Just adding true results in text extraction of several files no longer
> being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf
> PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
>
> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings
> no regressions but your text is not extracted properly.
>
> Maybe it is possible to include yet another rule for your file, but
> there's likely more to do and there is the risk that other files no
> longer extract properly.
>
> Tilman
>
> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
> > It seems that PDFBOX-5540 resolves a special case based on some
> dictionary
> > properties and chooses a predefined CMap (Identity CMap).
> >
> > Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
> > in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
> > doesn't contain 1 or more blocks of beginbfchar/endbfchar.
> >
> > The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
> > are really empty.
> >
> > But the font CMap stream contains this block:
> >
> > 2 begincidrange
> > <0001> <00FF> 1
> > <0100> <FFFF> 256
> > endcidrange
> >
> > I'm sorry if I misunderstood, but this is a valid CMap too (it seems a
> kind
> > of Identity mapping too, except for the 0x00...), isn't it?
> >
> > It's only shorter than the one I could have if I write several blocks of
> > beginbfchar/endbfchar.
> >
> > If I make this "dumb" modification (adding "true" to conditions) just
> for a
> > rapid test
> >
> > if (cmapName.contains("Identity") //
> > || ordering.contains("Identity") //
> > || COSName.IDENTITY_H.equals(encoding) //
> > || COSName.IDENTITY_V.equals(encoding) || true)
> > {
> > COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
> > if (true || encodingDict == null || !encodingDict.containsKey(COSName.
> > DIFFERENCES))
> > {
> > // assume that if encoding is identity, then the reverse is also true
> > cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
> > LOG.warn("Using predefined identity CMap instead");
> > }
> > }
> >
> > I've got "BCD" string like all the others
> >
> > The encoding parameter is ignored when writing to the console.
> > mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
> > loadUnicodeCmap
> > ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
> > mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
> > loadUnicodeCmap
> > ADVERTÊNCIA: Using predefined identity CMap instead
> > Página 4 de 4
> > Informações:  BCD
> >
> > Maybe the extract text tool should been using begincidrange/endcidrange
> > information...
> >
> > What do you think about?
> >
> > PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
> > Maybe I'm missing something... I'm sorry if this is the case...
> >
> > Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
> > lmodesto.work@gmail.com> escreveu:
> >
> >> Ok!
> >>
> >> I'll read PDFBOX-5540 and related issues.
> >>
> >> Thank you very much!
> >>
> >>
> >> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <THausherr@t-online.de
> >
> >> escreveu:
> >>
> >>> Hi,
> >>>
> >>> The problem is in the ToUnicode stream, there's a log message "Invalid
> >>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
> >>> PDFBox is trying a fallback solution which turns out to be wrong. This
> >>> is related to PDFBOX-5540 and earlier related issues.
> >>>
> >>> Tilman
> >>>
> >>>
> >>>
> >>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
> >>>> Hi Tilman!
> >>>>
> >>>>       Thank you very much for your attention!
> >>>>
> >>>>       You can find the file "p4_alt.pdf" in this folder
> >>>> <
> >>>
> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
> >>>> .
> >>>> "Extra infos.pdf" file shows some output from PDF Debugger and others.
> >>>>
> >>>>       I'm sorry, I sent the pdf file as an attachment in my first
> >>> message,
> >>>> but I didn't know that it wouldn't work.
> >>>>
> >>>>
> >>>>
> >>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
> >>> THausherr@t-online.de>
> >>>> escreveu:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> please upload your file to a sharehoster.
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>>       I'm not sure if this is the same as FAQ "How come I am getting
> >>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
> >>>>>>
> >>>>>>       I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
> >>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> >>>>>>
> >>>>>>       I'm trying to understand how this PDF chunk (from p4_fix.pdf
> >>>>> attached)
> >>>>>>     BT
> >>>>>>     /G1F7 6.0 Tf
> >>>>>>     94.871 773.806 Td
> >>>>>>     <004200430044> Tj
> >>>>>>     ET
> >>>>>>
> >>>>>>       becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
> >>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction
> tool.
> >>>>>>
> >>>>>>       Using the Poppler pdftotext (version 22.02.0) gives me "BCD"
> too.
> >>>>>>
> >>>>>>       The renders that allow me to copy the text give me "BCD" text.
> >>>>>>
> >>>>>>       It seems that PDFBox extraction tool follows the item "9.10.2
> >>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but
> all
> >>>>>> the others choose a different way.
> >>>>>>
> >>>>>>        Could you help me to understand if there is a problem with
> the
> >>>>>> PDF file, with the renders or with the extract text tool?
> >>>>>>
> >>>>>> Thank you!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>
> >>>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>
> >>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Andreas Lehmkühler <an...@lehmi.de.INVALID>.


Am 25.03.24 um 10:07 schrieb Tilman Hausherr:
> On 25.03.2024 07:48, Andreas Lehmkühler wrote:
>> Thanks for the URLs. All of them are working with my change.
>>
>> See https://issues.apache.org/jira/browse/PDFBOX-5790 for further 
>> details.
>>
>> @Tilman Please run your tests if possible
> 
> No regressions 👍
Cool, thanks for the retest
> 
> Tilman
> 
> 
> 
>>
>> Andreas
>>
>> Am 24.03.24 um 16:39 schrieb Tilman Hausherr:
>>> Here they are, remove the XXX
>>>
>>> https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
>>> https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP
>>> https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D
>>>
>>> The extension p1 / p3 means I split these files and used only one 
>>> page for my own tests.
>>>
>>> Tilman
>>>
>>>
>>> On 24.03.2024 16:19, Andreas Lehmkühler wrote:
>>>>
>>>>
>>>> Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
>>>>> You are correct that it's the "fb" parts that are missing. (And 
>>>>> some of the other tools you tried also mention this)
>>>>>
>>>>> Just adding true results in text extraction of several files no 
>>>>> longer being correct, 433525-p1.pdf 
>>>>> O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf 
>>>>> R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
>>>> I've found a solution which works with provided pdf and with 
>>>> PDFBOX-5540.pdf.
>>>>
>>>> @Tilman I guess the other files are from our test corpus? If so, 
>>>> were exactly can I find them?
>>>>
>>>> Andreas
>>>>
>>>>>
>>>>> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" 
>>>>> brings no regressions but your text is not extracted properly.
>>>>>
>>>>> Maybe it is possible to include yet another rule for your file, but 
>>>>> there's likely more to do and there is the risk that other files no 
>>>>> longer extract properly.
>>>>>
>>>>> Tilman
>>>>>
>>>>> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
>>>>>> It seems that PDFBOX-5540 resolves a special case based on some 
>>>>>> dictionary
>>>>>> properties and chooses a predefined CMap (Identity CMap).
>>>>>>
>>>>>> Reading the PDFont.java code, I think the warning "Invalid 
>>>>>> ToUnicode CMap
>>>>>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
>>>>>> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
>>>>>>
>>>>>> The two CMap's HashMaps (charToUnicodeOneByte and 
>>>>>> charToUnicodeTwoBytes)
>>>>>> are really empty.
>>>>>>
>>>>>> But the font CMap stream contains this block:
>>>>>>
>>>>>> 2 begincidrange
>>>>>> <0001> <00FF> 1
>>>>>> <0100> <FFFF> 256
>>>>>> endcidrange
>>>>>>
>>>>>> I'm sorry if I misunderstood, but this is a valid CMap too (it 
>>>>>> seems a kind
>>>>>> of Identity mapping too, except for the 0x00...), isn't it?
>>>>>>
>>>>>> It's only shorter than the one I could have if I write several 
>>>>>> blocks of
>>>>>> beginbfchar/endbfchar.
>>>>>>
>>>>>> If I make this "dumb" modification (adding "true" to conditions) 
>>>>>> just for a
>>>>>> rapid test
>>>>>>
>>>>>> if (cmapName.contains("Identity") //
>>>>>> || ordering.contains("Identity") //
>>>>>> || COSName.IDENTITY_H.equals(encoding) //
>>>>>> || COSName.IDENTITY_V.equals(encoding) || true)
>>>>>> {
>>>>>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
>>>>>> if (true || encodingDict == null || 
>>>>>> !encodingDict.containsKey(COSName.
>>>>>> DIFFERENCES))
>>>>>> {
>>>>>> // assume that if encoding is identity, then the reverse is also true
>>>>>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
>>>>>> LOG.warn("Using predefined identity CMap instead");
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> I've got "BCD" string like all the others
>>>>>>
>>>>>> The encoding parameter is ignored when writing to the console.
>>>>>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
>>>>>> loadUnicodeCmap
>>>>>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
>>>>>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
>>>>>> loadUnicodeCmap
>>>>>> ADVERTÊNCIA: Using predefined identity CMap instead
>>>>>> Página 4 de 4
>>>>>> Informações:  BCD
>>>>>>
>>>>>> Maybe the extract text tool should been using 
>>>>>> begincidrange/endcidrange
>>>>>> information...
>>>>>>
>>>>>> What do you think about?
>>>>>>
>>>>>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite 
>>>>>> long.
>>>>>> Maybe I'm missing something... I'm sorry if this is the case...
>>>>>>
>>>>>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
>>>>>> lmodesto.work@gmail.com> escreveu:
>>>>>>
>>>>>>> Ok!
>>>>>>>
>>>>>>> I'll read PDFBOX-5540 and related issues.
>>>>>>>
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>>
>>>>>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 
>>>>>>> <TH...@t-online.de>
>>>>>>> escreveu:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> The problem is in the ToUnicode stream, there's a log message 
>>>>>>>> "Invalid
>>>>>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode 
>>>>>>>> mappings.
>>>>>>>> PDFBox is trying a fallback solution which turns out to be 
>>>>>>>> wrong. This
>>>>>>>> is related to PDFBOX-5540 and earlier related issues.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>>>>>>>>> Hi Tilman!
>>>>>>>>>
>>>>>>>>>       Thank you very much for your attention!
>>>>>>>>>
>>>>>>>>>       You can find the file "p4_alt.pdf" in this folder
>>>>>>>>> <
>>>>>>>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
>>>>>>>>> .
>>>>>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and 
>>>>>>>>> others.
>>>>>>>>>
>>>>>>>>>       I'm sorry, I sent the pdf file as an attachment in my first
>>>>>>>> message,
>>>>>>>>> but I didn't know that it wouldn't work.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>>>>>>>> THausherr@t-online.de>
>>>>>>>>> escreveu:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> please upload your file to a sharehoster.
>>>>>>>>>>
>>>>>>>>>> Tilman
>>>>>>>>>>
>>>>>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>>       I'm not sure if this is the same as FAQ "How come I am 
>>>>>>>>>>> getting
>>>>>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
>>>>>>>>>>>
>>>>>>>>>>>       I'm using PDFBox version 3.0.1 and OpenJDK Runtime 
>>>>>>>>>>> Environment
>>>>>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>>>>>>>>>>>
>>>>>>>>>>>       I'm trying to understand how this PDF chunk (from 
>>>>>>>>>>> p4_fix.pdf
>>>>>>>>>> attached)
>>>>>>>>>>>     BT
>>>>>>>>>>>     /G1F7 6.0 Tf
>>>>>>>>>>>     94.871 773.806 Td
>>>>>>>>>>>     <004200430044> Tj
>>>>>>>>>>>     ET
>>>>>>>>>>>
>>>>>>>>>>>       becomes "BCD" on PDFBox Debugger (the same on qpdfview, 
>>>>>>>>>>> Adobe
>>>>>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text 
>>>>>>>>>>> extraction tool.
>>>>>>>>>>>
>>>>>>>>>>>       Using the Poppler pdftotext (version 22.02.0) gives me 
>>>>>>>>>>> "BCD" too.
>>>>>>>>>>>
>>>>>>>>>>>       The renders that allow me to copy the text give me 
>>>>>>>>>>> "BCD" text.
>>>>>>>>>>>
>>>>>>>>>>>       It seems that PDFBox extraction tool follows the item 
>>>>>>>>>>> "9.10.2
>>>>>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) 
>>>>>>>>>>> but all
>>>>>>>>>>> the others choose a different way.
>>>>>>>>>>>
>>>>>>>>>>>        Could you help me to understand if there is a problem 
>>>>>>>>>>> with the
>>>>>>>>>>> PDF file, with the renders or with the extract text tool?
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Tilman Hausherr <TH...@t-online.de>.

On 25.03.2024 07:48, Andreas Lehmkühler wrote:
> Thanks for the URLs. All of them are working with my change.
>
> See https://issues.apache.org/jira/browse/PDFBOX-5790 for further 
> details.
>
> @Tilman Please run your tests if possible

No regressions 👍

Tilman



>
> Andreas
>
> Am 24.03.24 um 16:39 schrieb Tilman Hausherr:
>> Here they are, remove the XXX
>>
>> https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
>> https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP 
>>
>> https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D 
>>
>>
>> The extension p1 / p3 means I split these files and used only one 
>> page for my own tests.
>>
>> Tilman
>>
>>
>> On 24.03.2024 16:19, Andreas Lehmkühler wrote:
>>>
>>>
>>> Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
>>>> You are correct that it's the "fb" parts that are missing. (And 
>>>> some of the other tools you tried also mention this)
>>>>
>>>> Just adding true results in text extraction of several files no 
>>>> longer being correct, 433525-p1.pdf 
>>>> O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf 
>>>> R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
>>> I've found a solution which works with provided pdf and with 
>>> PDFBOX-5540.pdf.
>>>
>>> @Tilman I guess the other files are from our test corpus? If so, 
>>> were exactly can I find them?
>>>
>>> Andreas
>>>
>>>>
>>>> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" 
>>>> brings no regressions but your text is not extracted properly.
>>>>
>>>> Maybe it is possible to include yet another rule for your file, but 
>>>> there's likely more to do and there is the risk that other files no 
>>>> longer extract properly.
>>>>
>>>> Tilman
>>>>
>>>> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
>>>>> It seems that PDFBOX-5540 resolves a special case based on some 
>>>>> dictionary
>>>>> properties and chooses a predefined CMap (Identity CMap).
>>>>>
>>>>> Reading the PDFont.java code, I think the warning "Invalid 
>>>>> ToUnicode CMap
>>>>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
>>>>> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
>>>>>
>>>>> The two CMap's HashMaps (charToUnicodeOneByte and 
>>>>> charToUnicodeTwoBytes)
>>>>> are really empty.
>>>>>
>>>>> But the font CMap stream contains this block:
>>>>>
>>>>> 2 begincidrange
>>>>> <0001> <00FF> 1
>>>>> <0100> <FFFF> 256
>>>>> endcidrange
>>>>>
>>>>> I'm sorry if I misunderstood, but this is a valid CMap too (it 
>>>>> seems a kind
>>>>> of Identity mapping too, except for the 0x00...), isn't it?
>>>>>
>>>>> It's only shorter than the one I could have if I write several 
>>>>> blocks of
>>>>> beginbfchar/endbfchar.
>>>>>
>>>>> If I make this "dumb" modification (adding "true" to conditions) 
>>>>> just for a
>>>>> rapid test
>>>>>
>>>>> if (cmapName.contains("Identity") //
>>>>> || ordering.contains("Identity") //
>>>>> || COSName.IDENTITY_H.equals(encoding) //
>>>>> || COSName.IDENTITY_V.equals(encoding) || true)
>>>>> {
>>>>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
>>>>> if (true || encodingDict == null || 
>>>>> !encodingDict.containsKey(COSName.
>>>>> DIFFERENCES))
>>>>> {
>>>>> // assume that if encoding is identity, then the reverse is also true
>>>>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
>>>>> LOG.warn("Using predefined identity CMap instead");
>>>>> }
>>>>> }
>>>>>
>>>>> I've got "BCD" string like all the others
>>>>>
>>>>> The encoding parameter is ignored when writing to the console.
>>>>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
>>>>> loadUnicodeCmap
>>>>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
>>>>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
>>>>> loadUnicodeCmap
>>>>> ADVERTÊNCIA: Using predefined identity CMap instead
>>>>> Página 4 de 4
>>>>> Informações:  BCD
>>>>>
>>>>> Maybe the extract text tool should been using 
>>>>> begincidrange/endcidrange
>>>>> information...
>>>>>
>>>>> What do you think about?
>>>>>
>>>>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite 
>>>>> long.
>>>>> Maybe I'm missing something... I'm sorry if this is the case...
>>>>>
>>>>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
>>>>> lmodesto.work@gmail.com> escreveu:
>>>>>
>>>>>> Ok!
>>>>>>
>>>>>> I'll read PDFBOX-5540 and related issues.
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>>
>>>>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 
>>>>>> <TH...@t-online.de>
>>>>>> escreveu:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> The problem is in the ToUnicode stream, there's a log message 
>>>>>>> "Invalid
>>>>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode 
>>>>>>> mappings.
>>>>>>> PDFBox is trying a fallback solution which turns out to be 
>>>>>>> wrong. This
>>>>>>> is related to PDFBOX-5540 and earlier related issues.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>>>>>>>> Hi Tilman!
>>>>>>>>
>>>>>>>>       Thank you very much for your attention!
>>>>>>>>
>>>>>>>>       You can find the file "p4_alt.pdf" in this folder
>>>>>>>> <
>>>>>>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing 
>>>>>>>
>>>>>>>> .
>>>>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and 
>>>>>>>> others.
>>>>>>>>
>>>>>>>>       I'm sorry, I sent the pdf file as an attachment in my first
>>>>>>> message,
>>>>>>>> but I didn't know that it wouldn't work.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>>>>>>> THausherr@t-online.de>
>>>>>>>> escreveu:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> please upload your file to a sharehoster.
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>>       I'm not sure if this is the same as FAQ "How come I am 
>>>>>>>>>> getting
>>>>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
>>>>>>>>>>
>>>>>>>>>>       I'm using PDFBox version 3.0.1 and OpenJDK Runtime 
>>>>>>>>>> Environment
>>>>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>>>>>>>>>>
>>>>>>>>>>       I'm trying to understand how this PDF chunk (from 
>>>>>>>>>> p4_fix.pdf
>>>>>>>>> attached)
>>>>>>>>>>     BT
>>>>>>>>>>     /G1F7 6.0 Tf
>>>>>>>>>>     94.871 773.806 Td
>>>>>>>>>>     <004200430044> Tj
>>>>>>>>>>     ET
>>>>>>>>>>
>>>>>>>>>>       becomes "BCD" on PDFBox Debugger (the same on qpdfview, 
>>>>>>>>>> Adobe
>>>>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text 
>>>>>>>>>> extraction tool.
>>>>>>>>>>
>>>>>>>>>>       Using the Poppler pdftotext (version 22.02.0) gives me 
>>>>>>>>>> "BCD" too.
>>>>>>>>>>
>>>>>>>>>>       The renders that allow me to copy the text give me 
>>>>>>>>>> "BCD" text.
>>>>>>>>>>
>>>>>>>>>>       It seems that PDFBox extraction tool follows the item 
>>>>>>>>>> "9.10.2
>>>>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) 
>>>>>>>>>> but all
>>>>>>>>>> the others choose a different way.
>>>>>>>>>>
>>>>>>>>>>        Could you help me to understand if there is a problem 
>>>>>>>>>> with the
>>>>>>>>>> PDF file, with the renders or with the extract text tool?
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Andreas Lehmkühler <an...@lehmi.de.INVALID>.

Thanks for the URLs. All of them are working with my change.

See https://issues.apache.org/jira/browse/PDFBOX-5790 for further details.

@Tilman Please run your tests if possible

Andreas

Am 24.03.24 um 16:39 schrieb Tilman Hausherr:
> Here they are, remove the XXX
> 
> https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
> https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP
> https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D
> 
> The extension p1 / p3 means I split these files and used only one page 
> for my own tests.
> 
> Tilman
> 
> 
> On 24.03.2024 16:19, Andreas Lehmkühler wrote:
>>
>>
>> Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
>>> You are correct that it's the "fb" parts that are missing. (And some 
>>> of the other tools you tried also mention this)
>>>
>>> Just adding true results in text extraction of several files no 
>>> longer being correct, 433525-p1.pdf 
>>> O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf 
>>> R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
>> I've found a solution which works with provided pdf and with 
>> PDFBOX-5540.pdf.
>>
>> @Tilman I guess the other files are from our test corpus? If so, were 
>> exactly can I find them?
>>
>> Andreas
>>
>>>
>>> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" 
>>> brings no regressions but your text is not extracted properly.
>>>
>>> Maybe it is possible to include yet another rule for your file, but 
>>> there's likely more to do and there is the risk that other files no 
>>> longer extract properly.
>>>
>>> Tilman
>>>
>>> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
>>>> It seems that PDFBOX-5540 resolves a special case based on some 
>>>> dictionary
>>>> properties and chooses a predefined CMap (Identity CMap).
>>>>
>>>> Reading the PDFont.java code, I think the warning "Invalid ToUnicode 
>>>> CMap
>>>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
>>>> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
>>>>
>>>> The two CMap's HashMaps (charToUnicodeOneByte and 
>>>> charToUnicodeTwoBytes)
>>>> are really empty.
>>>>
>>>> But the font CMap stream contains this block:
>>>>
>>>> 2 begincidrange
>>>> <0001> <00FF> 1
>>>> <0100> <FFFF> 256
>>>> endcidrange
>>>>
>>>> I'm sorry if I misunderstood, but this is a valid CMap too (it seems 
>>>> a kind
>>>> of Identity mapping too, except for the 0x00...), isn't it?
>>>>
>>>> It's only shorter than the one I could have if I write several 
>>>> blocks of
>>>> beginbfchar/endbfchar.
>>>>
>>>> If I make this "dumb" modification (adding "true" to conditions) 
>>>> just for a
>>>> rapid test
>>>>
>>>> if (cmapName.contains("Identity") //
>>>> || ordering.contains("Identity") //
>>>> || COSName.IDENTITY_H.equals(encoding) //
>>>> || COSName.IDENTITY_V.equals(encoding) || true)
>>>> {
>>>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
>>>> if (true || encodingDict == null || !encodingDict.containsKey(COSName.
>>>> DIFFERENCES))
>>>> {
>>>> // assume that if encoding is identity, then the reverse is also true
>>>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
>>>> LOG.warn("Using predefined identity CMap instead");
>>>> }
>>>> }
>>>>
>>>> I've got "BCD" string like all the others
>>>>
>>>> The encoding parameter is ignored when writing to the console.
>>>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
>>>> loadUnicodeCmap
>>>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
>>>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
>>>> loadUnicodeCmap
>>>> ADVERTÊNCIA: Using predefined identity CMap instead
>>>> Página 4 de 4
>>>> Informações:  BCD
>>>>
>>>> Maybe the extract text tool should been using begincidrange/endcidrange
>>>> information...
>>>>
>>>> What do you think about?
>>>>
>>>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
>>>> Maybe I'm missing something... I'm sorry if this is the case...
>>>>
>>>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
>>>> lmodesto.work@gmail.com> escreveu:
>>>>
>>>>> Ok!
>>>>>
>>>>> I'll read PDFBOX-5540 and related issues.
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>>
>>>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 
>>>>> <TH...@t-online.de>
>>>>> escreveu:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The problem is in the ToUnicode stream, there's a log message 
>>>>>> "Invalid
>>>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode 
>>>>>> mappings.
>>>>>> PDFBox is trying a fallback solution which turns out to be wrong. 
>>>>>> This
>>>>>> is related to PDFBOX-5540 and earlier related issues.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>>>>>>> Hi Tilman!
>>>>>>>
>>>>>>>       Thank you very much for your attention!
>>>>>>>
>>>>>>>       You can find the file "p4_alt.pdf" in this folder
>>>>>>> <
>>>>>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
>>>>>>> .
>>>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and 
>>>>>>> others.
>>>>>>>
>>>>>>>       I'm sorry, I sent the pdf file as an attachment in my first
>>>>>> message,
>>>>>>> but I didn't know that it wouldn't work.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>>>>>> THausherr@t-online.de>
>>>>>>> escreveu:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> please upload your file to a sharehoster.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>>       I'm not sure if this is the same as FAQ "How come I am 
>>>>>>>>> getting
>>>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
>>>>>>>>>
>>>>>>>>>       I'm using PDFBox version 3.0.1 and OpenJDK Runtime 
>>>>>>>>> Environment
>>>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>>>>>>>>>
>>>>>>>>>       I'm trying to understand how this PDF chunk (from p4_fix.pdf
>>>>>>>> attached)
>>>>>>>>>     BT
>>>>>>>>>     /G1F7 6.0 Tf
>>>>>>>>>     94.871 773.806 Td
>>>>>>>>>     <004200430044> Tj
>>>>>>>>>     ET
>>>>>>>>>
>>>>>>>>>       becomes "BCD" on PDFBox Debugger (the same on qpdfview, 
>>>>>>>>> Adobe
>>>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text 
>>>>>>>>> extraction tool.
>>>>>>>>>
>>>>>>>>>       Using the Poppler pdftotext (version 22.02.0) gives me 
>>>>>>>>> "BCD" too.
>>>>>>>>>
>>>>>>>>>       The renders that allow me to copy the text give me "BCD" 
>>>>>>>>> text.
>>>>>>>>>
>>>>>>>>>       It seems that PDFBox extraction tool follows the item 
>>>>>>>>> "9.10.2
>>>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) 
>>>>>>>>> but all
>>>>>>>>> the others choose a different way.
>>>>>>>>>
>>>>>>>>>        Could you help me to understand if there is a problem 
>>>>>>>>> with the
>>>>>>>>> PDF file, with the renders or with the extract text tool?
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Tilman Hausherr <TH...@t-online.de>.

Here they are, remove the XXX

https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D

The extension p1 / p3 means I split these files and used only one page 
for my own tests.

Tilman


On 24.03.2024 16:19, Andreas Lehmkühler wrote:
>
>
> Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
>> You are correct that it's the "fb" parts that are missing. (And some 
>> of the other tools you tried also mention this)
>>
>> Just adding true results in text extraction of several files no 
>> longer being correct, 433525-p1.pdf 
>> O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf 
>> R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
> I've found a solution which works with provided pdf and with 
> PDFBOX-5540.pdf.
>
> @Tilman I guess the other files are from our test corpus? If so, were 
> exactly can I find them?
>
> Andreas
>
>>
>> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" 
>> brings no regressions but your text is not extracted properly.
>>
>> Maybe it is possible to include yet another rule for your file, but 
>> there's likely more to do and there is the risk that other files no 
>> longer extract properly.
>>
>> Tilman
>>
>> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
>>> It seems that PDFBOX-5540 resolves a special case based on some 
>>> dictionary
>>> properties and chooses a predefined CMap (Identity CMap).
>>>
>>> Reading the PDFont.java code, I think the warning "Invalid ToUnicode 
>>> CMap
>>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
>>> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
>>>
>>> The two CMap's HashMaps (charToUnicodeOneByte and 
>>> charToUnicodeTwoBytes)
>>> are really empty.
>>>
>>> But the font CMap stream contains this block:
>>>
>>> 2 begincidrange
>>> <0001> <00FF> 1
>>> <0100> <FFFF> 256
>>> endcidrange
>>>
>>> I'm sorry if I misunderstood, but this is a valid CMap too (it seems 
>>> a kind
>>> of Identity mapping too, except for the 0x00...), isn't it?
>>>
>>> It's only shorter than the one I could have if I write several 
>>> blocks of
>>> beginbfchar/endbfchar.
>>>
>>> If I make this "dumb" modification (adding "true" to conditions) 
>>> just for a
>>> rapid test
>>>
>>> if (cmapName.contains("Identity") //
>>> || ordering.contains("Identity") //
>>> || COSName.IDENTITY_H.equals(encoding) //
>>> || COSName.IDENTITY_V.equals(encoding) || true)
>>> {
>>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
>>> if (true || encodingDict == null || !encodingDict.containsKey(COSName.
>>> DIFFERENCES))
>>> {
>>> // assume that if encoding is identity, then the reverse is also true
>>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
>>> LOG.warn("Using predefined identity CMap instead");
>>> }
>>> }
>>>
>>> I've got "BCD" string like all the others
>>>
>>> The encoding parameter is ignored when writing to the console.
>>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
>>> loadUnicodeCmap
>>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
>>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
>>> loadUnicodeCmap
>>> ADVERTÊNCIA: Using predefined identity CMap instead
>>> Página 4 de 4
>>> Informações:  BCD
>>>
>>> Maybe the extract text tool should been using begincidrange/endcidrange
>>> information...
>>>
>>> What do you think about?
>>>
>>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
>>> Maybe I'm missing something... I'm sorry if this is the case...
>>>
>>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
>>> lmodesto.work@gmail.com> escreveu:
>>>
>>>> Ok!
>>>>
>>>> I'll read PDFBOX-5540 and related issues.
>>>>
>>>> Thank you very much!
>>>>
>>>>
>>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 
>>>> <TH...@t-online.de>
>>>> escreveu:
>>>>
>>>>> Hi,
>>>>>
>>>>> The problem is in the ToUnicode stream, there's a log message 
>>>>> "Invalid
>>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode 
>>>>> mappings.
>>>>> PDFBox is trying a fallback solution which turns out to be wrong. 
>>>>> This
>>>>> is related to PDFBOX-5540 and earlier related issues.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>>>>>> Hi Tilman!
>>>>>>
>>>>>>       Thank you very much for your attention!
>>>>>>
>>>>>>       You can find the file "p4_alt.pdf" in this folder
>>>>>> <
>>>>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing 
>>>>>
>>>>>> .
>>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and 
>>>>>> others.
>>>>>>
>>>>>>       I'm sorry, I sent the pdf file as an attachment in my first
>>>>> message,
>>>>>> but I didn't know that it wouldn't work.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>>> escreveu:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> please upload your file to a sharehoster.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>>       I'm not sure if this is the same as FAQ "How come I am 
>>>>>>>> getting
>>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
>>>>>>>>
>>>>>>>>       I'm using PDFBox version 3.0.1 and OpenJDK Runtime 
>>>>>>>> Environment
>>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>>>>>>>>
>>>>>>>>       I'm trying to understand how this PDF chunk (from p4_fix.pdf
>>>>>>> attached)
>>>>>>>>     BT
>>>>>>>>     /G1F7 6.0 Tf
>>>>>>>>     94.871 773.806 Td
>>>>>>>>     <004200430044> Tj
>>>>>>>>     ET
>>>>>>>>
>>>>>>>>       becomes "BCD" on PDFBox Debugger (the same on qpdfview, 
>>>>>>>> Adobe
>>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text 
>>>>>>>> extraction tool.
>>>>>>>>
>>>>>>>>       Using the Poppler pdftotext (version 22.02.0) gives me 
>>>>>>>> "BCD" too.
>>>>>>>>
>>>>>>>>       The renders that allow me to copy the text give me "BCD" 
>>>>>>>> text.
>>>>>>>>
>>>>>>>>       It seems that PDFBox extraction tool follows the item 
>>>>>>>> "9.10.2
>>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) 
>>>>>>>> but all
>>>>>>>> the others choose a different way.
>>>>>>>>
>>>>>>>>        Could you help me to understand if there is a problem 
>>>>>>>> with the
>>>>>>>> PDF file, with the renders or with the extract text tool?
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Andreas Lehmkühler <an...@lehmi.de.INVALID>.


Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
> You are correct that it's the "fb" parts that are missing. (And some of 
> the other tools you tried also mention this)
> 
> Just adding true results in text extraction of several files no longer 
> being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf 
> PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
I've found a solution which works with provided pdf and with 
PDFBOX-5540.pdf.

@Tilman I guess the other files are from our test corpus? If so, were 
exactly can I find them?

Andreas

> 
> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings 
> no regressions but your text is not extracted properly.
> 
> Maybe it is possible to include yet another rule for your file, but 
> there's likely more to do and there is the risk that other files no 
> longer extract properly.
> 
> Tilman
> 
> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
>> It seems that PDFBOX-5540 resolves a special case based on some 
>> dictionary
>> properties and chooses a predefined CMap (Identity CMap).
>>
>> Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
>> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
>>
>> The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
>> are really empty.
>>
>> But the font CMap stream contains this block:
>>
>> 2 begincidrange
>> <0001> <00FF> 1
>> <0100> <FFFF> 256
>> endcidrange
>>
>> I'm sorry if I misunderstood, but this is a valid CMap too (it seems a 
>> kind
>> of Identity mapping too, except for the 0x00...), isn't it?
>>
>> It's only shorter than the one I could have if I write several blocks of
>> beginbfchar/endbfchar.
>>
>> If I make this "dumb" modification (adding "true" to conditions) just 
>> for a
>> rapid test
>>
>> if (cmapName.contains("Identity") //
>> || ordering.contains("Identity") //
>> || COSName.IDENTITY_H.equals(encoding) //
>> || COSName.IDENTITY_V.equals(encoding) || true)
>> {
>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
>> if (true || encodingDict == null || !encodingDict.containsKey(COSName.
>> DIFFERENCES))
>> {
>> // assume that if encoding is identity, then the reverse is also true
>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
>> LOG.warn("Using predefined identity CMap instead");
>> }
>> }
>>
>> I've got "BCD" string like all the others
>>
>> The encoding parameter is ignored when writing to the console.
>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
>> loadUnicodeCmap
>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
>> loadUnicodeCmap
>> ADVERTÊNCIA: Using predefined identity CMap instead
>> Página 4 de 4
>> Informações:  BCD
>>
>> Maybe the extract text tool should been using begincidrange/endcidrange
>> information...
>>
>> What do you think about?
>>
>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
>> Maybe I'm missing something... I'm sorry if this is the case...
>>
>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
>> lmodesto.work@gmail.com> escreveu:
>>
>>> Ok!
>>>
>>> I'll read PDFBOX-5540 and related issues.
>>>
>>> Thank you very much!
>>>
>>>
>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <TH...@t-online.de>
>>> escreveu:
>>>
>>>> Hi,
>>>>
>>>> The problem is in the ToUnicode stream, there's a log message "Invalid
>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
>>>> PDFBox is trying a fallback solution which turns out to be wrong. This
>>>> is related to PDFBOX-5540 and earlier related issues.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>>>>> Hi Tilman!
>>>>>
>>>>>       Thank you very much for your attention!
>>>>>
>>>>>       You can find the file "p4_alt.pdf" in this folder
>>>>> <
>>>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
>>>>> .
>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and others.
>>>>>
>>>>>       I'm sorry, I sent the pdf file as an attachment in my first
>>>> message,
>>>>> but I didn't know that it wouldn't work.
>>>>>
>>>>>
>>>>>
>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>>>> THausherr@t-online.de>
>>>>> escreveu:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> please upload your file to a sharehoster.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>>       I'm not sure if this is the same as FAQ "How come I am getting
>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
>>>>>>>
>>>>>>>       I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>>>>>>>
>>>>>>>       I'm trying to understand how this PDF chunk (from p4_fix.pdf
>>>>>> attached)
>>>>>>>     BT
>>>>>>>     /G1F7 6.0 Tf
>>>>>>>     94.871 773.806 Td
>>>>>>>     <004200430044> Tj
>>>>>>>     ET
>>>>>>>
>>>>>>>       becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction 
>>>>>>> tool.
>>>>>>>
>>>>>>>       Using the Poppler pdftotext (version 22.02.0) gives me 
>>>>>>> "BCD" too.
>>>>>>>
>>>>>>>       The renders that allow me to copy the text give me "BCD" text.
>>>>>>>
>>>>>>>       It seems that PDFBox extraction tool follows the item "9.10.2
>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but 
>>>>>>> all
>>>>>>> the others choose a different way.
>>>>>>>
>>>>>>>        Could you help me to understand if there is a problem with 
>>>>>>> the
>>>>>>> PDF file, with the renders or with the extract text tool?
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Tilman Hausherr <TH...@t-online.de>.

You are correct that it's the "fb" parts that are missing. (And some of 
the other tools you tried also mention this)

Just adding true results in text extraction of several files no longer 
being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf 
PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf

Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings 
no regressions but your text is not extracted properly.

Maybe it is possible to include yet another rule for your file, but 
there's likely more to do and there is the risk that other files no 
longer extract properly.

Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
> It seems that PDFBOX-5540 resolves a special case based on some dictionary
> properties and chooses a predefined CMap (Identity CMap).
>
> Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
>
> The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
> are really empty.
>
> But the font CMap stream contains this block:
>
> 2 begincidrange
> <0001> <00FF> 1
> <0100> <FFFF> 256
> endcidrange
>
> I'm sorry if I misunderstood, but this is a valid CMap too (it seems a kind
> of Identity mapping too, except for the 0x00...), isn't it?
>
> It's only shorter than the one I could have if I write several blocks of
> beginbfchar/endbfchar.
>
> If I make this "dumb" modification (adding "true" to conditions) just for a
> rapid test
>
> if (cmapName.contains("Identity") //
> || ordering.contains("Identity") //
> || COSName.IDENTITY_H.equals(encoding) //
> || COSName.IDENTITY_V.equals(encoding) || true)
> {
> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
> if (true || encodingDict == null || !encodingDict.containsKey(COSName.
> DIFFERENCES))
> {
> // assume that if encoding is identity, then the reverse is also true
> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
> LOG.warn("Using predefined identity CMap instead");
> }
> }
>
> I've got "BCD" string like all the others
>
> The encoding parameter is ignored when writing to the console.
> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
> loadUnicodeCmap
> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
> loadUnicodeCmap
> ADVERTÊNCIA: Using predefined identity CMap instead
> Página 4 de 4
> Informações:  BCD
>
> Maybe the extract text tool should been using begincidrange/endcidrange
> information...
>
> What do you think about?
>
> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
> Maybe I'm missing something... I'm sorry if this is the case...
>
> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
> lmodesto.work@gmail.com> escreveu:
>
>> Ok!
>>
>> I'll read PDFBOX-5540 and related issues.
>>
>> Thank you very much!
>>
>>
>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <TH...@t-online.de>
>> escreveu:
>>
>>> Hi,
>>>
>>> The problem is in the ToUnicode stream, there's a log message "Invalid
>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
>>> PDFBox is trying a fallback solution which turns out to be wrong. This
>>> is related to PDFBOX-5540 and earlier related issues.
>>>
>>> Tilman
>>>
>>>
>>>
>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>>>> Hi Tilman!
>>>>
>>>>       Thank you very much for your attention!
>>>>
>>>>       You can find the file "p4_alt.pdf" in this folder
>>>> <
>>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
>>>> .
>>>> "Extra infos.pdf" file shows some output from PDF Debugger and others.
>>>>
>>>>       I'm sorry, I sent the pdf file as an attachment in my first
>>> message,
>>>> but I didn't know that it wouldn't work.
>>>>
>>>>
>>>>
>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>>> THausherr@t-online.de>
>>>> escreveu:
>>>>
>>>>> Hi,
>>>>>
>>>>> please upload your file to a sharehoster.
>>>>>
>>>>> Tilman
>>>>>
>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>>>>>> Hi everyone,
>>>>>>
>>>>>>       I'm not sure if this is the same as FAQ "How come I am getting
>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
>>>>>>
>>>>>>       I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>>>>>>
>>>>>>       I'm trying to understand how this PDF chunk (from p4_fix.pdf
>>>>> attached)
>>>>>>     BT
>>>>>>     /G1F7 6.0 Tf
>>>>>>     94.871 773.806 Td
>>>>>>     <004200430044> Tj
>>>>>>     ET
>>>>>>
>>>>>>       becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
>>>>>>
>>>>>>       Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
>>>>>>
>>>>>>       The renders that allow me to copy the text give me "BCD" text.
>>>>>>
>>>>>>       It seems that PDFBox extraction tool follows the item "9.10.2
>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
>>>>>> the others choose a different way.
>>>>>>
>>>>>>        Could you help me to understand if there is a problem with the
>>>>>> PDF file, with the renders or with the extract text tool?
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Luiz Marcelo Modesto <lm...@gmail.com>.

It seems that PDFBOX-5540 resolves a special case based on some dictionary
properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100> <FFFF> 256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it seems a kind
of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several blocks of
beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) just for a
rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || !encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using begincidrange/endcidrange
information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.work@gmail.com> escreveu:

> Ok!
>
> I'll read PDFBOX-5540 and related issues.
>
> Thank you very much!
>
>
> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <TH...@t-online.de>
> escreveu:
>
>> Hi,
>>
>> The problem is in the ToUnicode stream, there's a log message "Invalid
>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
>> PDFBox is trying a fallback solution which turns out to be wrong. This
>> is related to PDFBOX-5540 and earlier related issues.
>>
>> Tilman
>>
>>
>>
>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>> > Hi Tilman!
>> >
>> >      Thank you very much for your attention!
>> >
>> >      You can find the file "p4_alt.pdf" in this folder
>> > <
>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
>> >.
>> > "Extra infos.pdf" file shows some output from PDF Debugger and others.
>> >
>> >      I'm sorry, I sent the pdf file as an attachment in my first
>> message,
>> > but I didn't know that it wouldn't work.
>> >
>> >
>> >
>> > Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>> THausherr@t-online.de>
>> > escreveu:
>> >
>> >> Hi,
>> >>
>> >> please upload your file to a sharehoster.
>> >>
>> >> Tilman
>> >>
>> >> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>> >>> Hi everyone,
>> >>>
>> >>>      I'm not sure if this is the same as FAQ "How come I am getting
>> >>> gibberish(G38G43G36G51G5) when extracting text?"...
>> >>>
>> >>>      I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
>> >>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>> >>>
>> >>>      I'm trying to understand how this PDF chunk (from p4_fix.pdf
>> >> attached)
>> >>>    BT
>> >>>    /G1F7 6.0 Tf
>> >>>    94.871 773.806 Td
>> >>>    <004200430044> Tj
>> >>>    ET
>> >>>
>> >>>      becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
>> >>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
>> >>>
>> >>>      Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
>> >>>
>> >>>      The renders that allow me to copy the text give me "BCD" text.
>> >>>
>> >>>      It seems that PDFBox extraction tool follows the item "9.10.2
>> >>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
>> >>> the others choose a different way.
>> >>>
>> >>>       Could you help me to understand if there is a problem with the
>> >>> PDF file, with the renders or with the extract text tool?
>> >>>
>> >>> Thank you!
>> >>>
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> >> For additional commands, e-mail: users-help@pdfbox.apache.org
>> >>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Luiz Marcelo Modesto <lm...@gmail.com>.

Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <TH...@t-online.de>
escreveu:

> Hi,
>
> The problem is in the ToUnicode stream, there's a log message "Invalid
> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
> PDFBox is trying a fallback solution which turns out to be wrong. This
> is related to PDFBOX-5540 and earlier related issues.
>
> Tilman
>
>
>
> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
> > Hi Tilman!
> >
> >      Thank you very much for your attention!
> >
> >      You can find the file "p4_alt.pdf" in this folder
> > <
> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
> >.
> > "Extra infos.pdf" file shows some output from PDF Debugger and others.
> >
> >      I'm sorry, I sent the pdf file as an attachment in my first message,
> > but I didn't know that it wouldn't work.
> >
> >
> >
> > Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
> THausherr@t-online.de>
> > escreveu:
> >
> >> Hi,
> >>
> >> please upload your file to a sharehoster.
> >>
> >> Tilman
> >>
> >> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> >>> Hi everyone,
> >>>
> >>>      I'm not sure if this is the same as FAQ "How come I am getting
> >>> gibberish(G38G43G36G51G5) when extracting text?"...
> >>>
> >>>      I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
> >>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> >>>
> >>>      I'm trying to understand how this PDF chunk (from p4_fix.pdf
> >> attached)
> >>>    BT
> >>>    /G1F7 6.0 Tf
> >>>    94.871 773.806 Td
> >>>    <004200430044> Tj
> >>>    ET
> >>>
> >>>      becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
> >>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
> >>>
> >>>      Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
> >>>
> >>>      The renders that allow me to copy the text give me "BCD" text.
> >>>
> >>>      It seems that PDFBox extraction tool follows the item "9.10.2
> >>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
> >>> the others choose a different way.
> >>>
> >>>       Could you help me to understand if there is a problem with the
> >>> PDF file, with the renders or with the extract text tool?
> >>>
> >>> Thank you!
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

The problem is in the ToUnicode stream, there's a log message "Invalid 
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings. 
PDFBox is trying a fallback solution which turns out to be wrong. This 
is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
> Hi Tilman!
>
>      Thank you very much for your attention!
>
>      You can find the file "p4_alt.pdf" in this folder
> <https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing>.
> "Extra infos.pdf" file shows some output from PDF Debugger and others.
>
>      I'm sorry, I sent the pdf file as an attachment in my first message,
> but I didn't know that it wouldn't work.
>
>
>
> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <TH...@t-online.de>
> escreveu:
>
>> Hi,
>>
>> please upload your file to a sharehoster.
>>
>> Tilman
>>
>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>>> Hi everyone,
>>>
>>>      I'm not sure if this is the same as FAQ "How come I am getting
>>> gibberish(G38G43G36G51G5) when extracting text?"...
>>>
>>>      I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>>>
>>>      I'm trying to understand how this PDF chunk (from p4_fix.pdf
>> attached)
>>>    BT
>>>    /G1F7 6.0 Tf
>>>    94.871 773.806 Td
>>>    <004200430044> Tj
>>>    ET
>>>
>>>      becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
>>>
>>>      Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
>>>
>>>      The renders that allow me to copy the text give me "BCD" text.
>>>
>>>      It seems that PDFBox extraction tool follows the item "9.10.2
>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
>>> the others choose a different way.
>>>
>>>       Could you help me to understand if there is a problem with the
>>> PDF file, with the renders or with the extract text tool?
>>>
>>> Thank you!
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Luiz Marcelo Modesto <lm...@gmail.com>.

Hi Tilman!

    Thank you very much for your attention!

    You can find the file "p4_alt.pdf" in this folder
<https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing>.
"Extra infos.pdf" file shows some output from PDF Debugger and others.

    I'm sorry, I sent the pdf file as an attachment in my first message,
but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <TH...@t-online.de>
escreveu:

> Hi,
>
> please upload your file to a sharehoster.
>
> Tilman
>
> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> > Hi everyone,
> >
> >     I'm not sure if this is the same as FAQ "How come I am getting
> > gibberish(G38G43G36G51G5) when extracting text?"...
> >
> >     I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
> > (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> >
> >     I'm trying to understand how this PDF chunk (from p4_fix.pdf
> attached)
> >
> >   BT
> >   /G1F7 6.0 Tf
> >   94.871 773.806 Td
> >   <004200430044> Tj
> >   ET
> >
> >     becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
> > Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
> >
> >     Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
> >
> >     The renders that allow me to copy the text give me "BCD" text.
> >
> >     It seems that PDFBox extraction tool follows the item "9.10.2
> > Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
> > the others choose a different way.
> >
> >      Could you help me to understand if there is a problem with the
> > PDF file, with the renders or with the extract text tool?
> >
> > Thank you!
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Type 0 font - Text extraction X PDF Debugger

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> Hi everyone,
>
>     I'm not sure if this is the same as FAQ "How come I am getting 
> gibberish(G38G43G36G51G5) when extracting text?"...
>
>     I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment 
> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>
>     I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)
>
>   BT
>   /G1F7 6.0 Tf
>   94.871 773.806 Td
>   <004200430044> Tj
>   ET
>
>     becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe 
> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
>
>     Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
>
>     The renders that allow me to copy the text give me "BCD" text.
>
>     It seems that PDFBox extraction tool follows the item "9.10.2 
> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all 
> the others choose a different way.
>
>      Could you help me to understand if there is a problem with the 
> PDF file, with the renders or with the extract text tool?
>
> Thank you!
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org