You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "OYEBISI, Daniel" <do...@bdoc.com> on 2016/06/22 14:58:46 UTC

Empty glyphs

Hello,

I came across an issue while trying to extract the text using PDFTextStripper from the PDF file attached to this email.
When I open the txt document generated in the Notepad, it appears normal but when I open it with Notepad++ and it gives an interesting result.
Please can you have a look at this?

Thanks


Re: Empty glyphs

Posted by Olaf Drümmer <ol...@callassoftware.com>.
As I am enjoying being very anal at times…:

> On 25.06.2016, at 13:52, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Here's an excerpt the CMAP table

This probably would have to be called a
	CMap
which is to be distinguished from
	cmap
in TrueType/OpenType fonts.

To the best of my knowledge
	CMAP 
is not meaningful in the context of fonts in PDF.

Olaf


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Empty glyphs

Posted by Tilman Hausherr <TH...@t-online.de>.
Here's an excerpt the CMAP table of that font, to be found at 
Root/Pages/Kids/[0]/Resources/Font/F480/ToUnicode  :


/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
   /Registry (Adobe) def
   /Ordering (UCS) def
   /Supplement 0 def
end def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
1 beginbfchar
<0000> <ffff>
endbfchar
2 beginbfrange
<0001> <005f> <f020>
<0060> <00d0> <f080>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end



This means that characters in the content stream whole value is between 
0001 and 00d0 are converted to unicode starting with f020 (see 
beginbfrange - search for this word in the PDF 32000 specifiation).
https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

But the content stream has also

     [ (\000\000) ] TJ

16 times. This is being rendered as a square by Adobe and PDFBox. In the 
beginbfchar section, the 0000 is being converted to unicode ffff, this 
is the unicode non character. This becomes EF BF BF in utf8.

http://www.fileformat.info/info/unicode/char/ffff/index.htm

QED

Tilman





Am 23.06.2016 um 10:33 schrieb OYEBISI, Daniel:
> You can get the PDF file through this url
>
> http://www.pdf-archive.com/2016/06/23/modele-tableau-wingdings-3/
>
> -----Message d'origine-----
> De : Tilman Hausherr [mailto:THausherr@t-online.de]
> Envoy : mercredi 22 juin 2016 20:03
>  : users@pdfbox.apache.org
> Objet : Re: Empty glyphs
>
>   From what I see, the "whitespace" are EF BF BF which is not a valid
> UTF8 character. Please upload the PDF file somewhere.
>
> Tilman
>
> Am 22.06.2016 um 18:39 schrieb OYEBISI, Daniel:
>> The problem is with some of the whitespace that appears empty in Notepad but are really not.
>> Please try opening the text file with other text editors.
>> Thanks
>>
>> -----Message d'origine-----
>> De : Tilman Hausherr [mailto:THausherr@t-online.de] Envoy : mercredi
>> 22 juin 2016 17:54  : users@pdfbox.apache.org Objet : Re: Empty
>> glyphs
>>
>> Your PDF didn't get through (security) but this sounds like a N++ problem.
>>
>> I could display your txt file with the normal notepad, by changing the font to windings.
>>
>> Tilman
>>
>> Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>>> Hello,
>>>
>>> I came across an issue while trying to extract the text using
>>> PDFTextStripper from the PDF file attached to this email.
>>>
>>> When I open the txt document generated in the Notepad, it appears
>>> normal but when I open it with Notepad++ and it gives an interesting
>>> result.
>>>
>>> Please can you have a look at this?
>>>
>>> Thanks
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.
>> org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


RE: Empty glyphs

Posted by "OYEBISI, Daniel" <do...@bdoc.com>.
You can get the PDF file through this url 

http://www.pdf-archive.com/2016/06/23/modele-tableau-wingdings-3/

-----Message d'origine-----
De : Tilman Hausherr [mailto:THausherr@t-online.de] 
Envoyé : mercredi 22 juin 2016 20:03
À : users@pdfbox.apache.org
Objet : Re: Empty glyphs

 From what I see, the "whitespace" are EF BF BF which is not a valid
UTF8 character. Please upload the PDF file somewhere.

Tilman

Am 22.06.2016 um 18:39 schrieb OYEBISI, Daniel:
> The problem is with some of the whitespace that appears empty in Notepad but are really not.
> Please try opening the text file with other text editors.
> Thanks
>
> -----Message d'origine-----
> De : Tilman Hausherr [mailto:THausherr@t-online.de] Envoyé : mercredi 
> 22 juin 2016 17:54 À : users@pdfbox.apache.org Objet : Re: Empty 
> glyphs
>
> Your PDF didn't get through (security) but this sounds like a N++ problem.
>
> I could display your txt file with the normal notepad, by changing the font to windings.
>
> Tilman
>
> Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>> Hello,
>>
>> I came across an issue while trying to extract the text using 
>> PDFTextStripper from the PDF file attached to this email.
>>
>> When I open the txt document generated in the Notepad, it appears 
>> normal but when I open it with Notepad++ and it gives an interesting 
>> result.
>>
>> Please can you have a look at this?
>>
>> Thanks
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.

> org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Empty glyphs

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 22.06.2016 um 22:50 schrieb Brzrk One:
> isn't that the ByteOrderMark?

No:
https://stackoverflow.com/questions/10310210/is-ef-bf-bf-an-allowed-character-in-xml-utf-8
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

Tilman


>
> On Wed, Jun 22, 2016 at 2:03 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>>  From what I see, the "whitespace" are EF BF BF which is not a valid UTF8
>> character. Please upload the PDF file somewhere.
>>
>> Tilman
>>
>>
>> Am 22.06.2016 um 18:39 schrieb OYEBISI, Daniel:
>>
>>> The problem is with some of the whitespace that appears empty in Notepad
>>> but are really not.
>>> Please try opening the text file with other text editors.
>>> Thanks
>>>
>>> -----Message d'origine-----
>>> De : Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Envoy� : mercredi 22 juin 2016 17:54
>>> � : users@pdfbox.apache.org
>>> Objet : Re: Empty glyphs
>>>
>>> Your PDF didn't get through (security) but this sounds like a N++ problem.
>>>
>>> I could display your txt file with the normal notepad, by changing the
>>> font to windings.
>>>
>>> Tilman
>>>
>>> Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>>>
>>>> Hello,
>>>>
>>>> I came across an issue while trying to extract the text using
>>>> PDFTextStripper from the PDF file attached to this email.
>>>>
>>>> When I open the txt document generated in the Notepad, it appears
>>>> normal but when I open it with Notepad++ and it gives an interesting
>>>> result.
>>>>
>>>> Please can you have a look at this?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.
>>>
>> org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Empty glyphs

Posted by Brzrk One <br...@gmail.com>.
isn't that the ByteOrderMark?

On Wed, Jun 22, 2016 at 2:03 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> From what I see, the "whitespace" are EF BF BF which is not a valid UTF8
> character. Please upload the PDF file somewhere.
>
> Tilman
>
>
> Am 22.06.2016 um 18:39 schrieb OYEBISI, Daniel:
>
>> The problem is with some of the whitespace that appears empty in Notepad
>> but are really not.
>> Please try opening the text file with other text editors.
>> Thanks
>>
>> -----Message d'origine-----
>> De : Tilman Hausherr [mailto:THausherr@t-online.de]
>> Envoyé : mercredi 22 juin 2016 17:54
>> À : users@pdfbox.apache.org
>> Objet : Re: Empty glyphs
>>
>> Your PDF didn't get through (security) but this sounds like a N++ problem.
>>
>> I could display your txt file with the normal notepad, by changing the
>> font to windings.
>>
>> Tilman
>>
>> Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>>
>>> Hello,
>>>
>>> I came across an issue while trying to extract the text using
>>> PDFTextStripper from the PDF file attached to this email.
>>>
>>> When I open the txt document generated in the Notepad, it appears
>>> normal but when I open it with Notepad++ and it gives an interesting
>>> result.
>>>
>>> Please can you have a look at this?
>>>
>>> Thanks
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.
>>
>
> org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Empty glyphs

Posted by Frank van der Hulst <dr...@gmail.com>.
N++ has a hex display/edit plugin. You need to use that to see exactly what
text is "hidden".


On Thu, Jun 23, 2016 at 6:03 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> From what I see, the "whitespace" are EF BF BF which is not a valid UTF8
> character. Please upload the PDF file somewhere.
>
> Tilman
>
>
> Am 22.06.2016 um 18:39 schrieb OYEBISI, Daniel:
>
>> The problem is with some of the whitespace that appears empty in Notepad
>> but are really not.
>> Please try opening the text file with other text editors.
>> Thanks
>>
>> -----Message d'origine-----
>> De : Tilman Hausherr [mailto:THausherr@t-online.de]
>> Envoyé : mercredi 22 juin 2016 17:54
>> À : users@pdfbox.apache.org
>> Objet : Re: Empty glyphs
>>
>> Your PDF didn't get through (security) but this sounds like a N++ problem.
>>
>> I could display your txt file with the normal notepad, by changing the
>> font to windings.
>>
>> Tilman
>>
>> Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>>
>>> Hello,
>>>
>>> I came across an issue while trying to extract the text using
>>> PDFTextStripper from the PDF file attached to this email.
>>>
>>> When I open the txt document generated in the Notepad, it appears
>>> normal but when I open it with Notepad++ and it gives an interesting
>>> result.
>>>
>>> Please can you have a look at this?
>>>
>>> Thanks
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.
>>
>
> org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Empty glyphs

Posted by Tilman Hausherr <TH...@t-online.de>.
 From what I see, the "whitespace" are EF BF BF which is not a valid 
UTF8 character. Please upload the PDF file somewhere.

Tilman

Am 22.06.2016 um 18:39 schrieb OYEBISI, Daniel:
> The problem is with some of the whitespace that appears empty in Notepad but are really not.
> Please try opening the text file with other text editors.
> Thanks
>
> -----Message d'origine-----
> De : Tilman Hausherr [mailto:THausherr@t-online.de]
> Envoy : mercredi 22 juin 2016 17:54
>  : users@pdfbox.apache.org
> Objet : Re: Empty glyphs
>
> Your PDF didn't get through (security) but this sounds like a N++ problem.
>
> I could display your txt file with the normal notepad, by changing the font to windings.
>
> Tilman
>
> Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>> Hello,
>>
>> I came across an issue while trying to extract the text using
>> PDFTextStripper from the PDF file attached to this email.
>>
>> When I open the txt document generated in the Notepad, it appears
>> normal but when I open it with Notepad++ and it gives an interesting
>> result.
>>
>> Please can you have a look at this?
>>
>> Thanks
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.

> org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


RE: Empty glyphs

Posted by "OYEBISI, Daniel" <do...@bdoc.com>.
The problem is with some of the whitespace that appears empty in Notepad but are really not.
Please try opening the text file with other text editors.
Thanks

-----Message d'origine-----
De : Tilman Hausherr [mailto:THausherr@t-online.de] 
Envoyé : mercredi 22 juin 2016 17:54
À : users@pdfbox.apache.org
Objet : Re: Empty glyphs

Your PDF didn't get through (security) but this sounds like a N++ problem.

I could display your txt file with the normal notepad, by changing the font to windings.

Tilman

Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>
> Hello,
>
> I came across an issue while trying to extract the text using 
> PDFTextStripper from the PDF file attached to this email.
>
> When I open the txt document generated in the Notepad, it appears 
> normal but when I open it with Notepad++ and it gives an interesting 
> result.
>
> Please can you have a look at this?
>
> Thanks
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Empty glyphs

Posted by Tilman Hausherr <TH...@t-online.de>.
Your PDF didn't get through (security) but this sounds like a N++ problem.

I could display your txt file with the normal notepad, by changing the 
font to windings.

Tilman

Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
>
> Hello,
>
> I came across an issue while trying to extract the text using 
> PDFTextStripper from the PDF file attached to this email.
>
> When I open the txt document generated in the Notepad, it appears 
> normal but when I open it with Notepad++ and it gives an interesting 
> result.
>
> Please can you have a look at this?
>
> Thanks
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org