You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Brangs, Erik" <E....@dnb.de> on 2023/08/02 09:20:22 UTC

Supressing warnings for missing unicode mappings

Hi,

we're using PDFBox 3.0.0-beta1 to extract text from PDFs. This produces lots of warnings about missing unicode mappings. Is there a programmatic way to suppress those messages or would it be better to configure the logging to do that?

If it's better to configure logging, I would try to configure the logging level for PDSimpleFont, PDType0Font, PDFont and GlyphList. Are those all relevant loggers or are there any more?

For GlyphList, the most common warning is "Not a number in Unicode character name: unionsq". I also saw a warning "Not a number in Unicode character name: users" but only for one PDF.


Mit freundlichen Grüßen
Erik Brangs
*** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek ***

-- 
Erik Brangs
Deutsche Nationalbibliothek
Informationstechnik
Adickesallee 1
60322 Frankfurt am Main
Telefon: +49 69 1525-1792
Telefax: +49 69 1525-1799
mailto:e.brangs@dnb.de
http://www.dnb.de


Re: AW: Supressing warnings for missing unicode mappings

Posted by Tilman Hausherr <TH...@t-online.de>.
Thanks. It happens on page 15, the square union symbol, near

     "the space obtained from M1 ∪ M2 by gluing along φ"

where a squared glyph is used instead of "∪". So I found this file:
https://github.com/kohler/lcdf-typetools/blob/master/texglyphlist.txt
so this is an extension for TeX.

Also:
https://gist.github.com/RAnders00/09b69031fb0cdd429ba1e3e75cce2898

The license is LGPL so we can't use it, but we don't have to think about 
it because Adobe also can't extract the text. Adobe also fails to 
extract the "∪".

I was also wondering what this squared union symbol is about, and it 
turned out that there are other such symbols and that there is no 
universal meaning.
https://math.stackexchange.com/questions/1929439/what-does-square-subset-and-square-union-symbol-mean
https://math.stackexchange.com/questions/1569400/does-sqsubset-have-any-special-meaning

I'll keep page 15 for my own text extraction tests to detect related 
code changes.

Tilman



On 03.08.2023 09:42, Brangs, Erik wrote:
> Hi,
>
> thank you.
>
> Here is a link to a PDF that shows the unionsq warning:
>
> https://d-nb.info/1267991550/34
>
>
>
>> -----Ursprüngliche Nachricht-----
>> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Gesendet: Mittwoch, 2. August 2023 20:18
>> An: users@pdfbox.apache.org
>> Betreff: Re: Supressing warnings for missing unicode mappings
>>
>> Hi,
>>
>> Yes, reducing logging is the way. I don't know if there are more.
>>
>> I'd also be interested in the "unionsq" file, I wonder if this is a
>> false positive. This happens because "uniNNNN" is a valid glyph name.
>> There is unionsqdisplay and unionsqtext too, but not unionsq.
>>
>> Tilman
>>
>> On 02.08.2023 11:20, Brangs, Erik wrote:
>>> Hi,
>>>
>>> we're using PDFBox 3.0.0-beta1 to extract text from PDFs. This produces lots of
>> warnings about missing unicode mappings. Is there a programmatic way to suppress
>> those messages or would it be better to configure the logging to do that?
>>> If it's better to configure logging, I would try to configure the logging level for
>> PDSimpleFont, PDType0Font, PDFont and GlyphList. Are those all relevant loggers or
>> are there any more?
>>> For GlyphList, the most common warning is "Not a number in Unicode character
>> name: unionsq". I also saw a warning "Not a number in Unicode character name:
>> users" but only for one PDF.
>>>
>>> Mit freundlichen Grüßen
>>> Erik Brangs
>>> *** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek ***
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


AW: Supressing warnings for missing unicode mappings

Posted by "Brangs, Erik" <E....@dnb.de>.
Hi,

thank you.

Here is a link to a PDF that shows the unionsq warning:

https://d-nb.info/1267991550/34



> -----Ursprüngliche Nachricht-----
> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
> Gesendet: Mittwoch, 2. August 2023 20:18
> An: users@pdfbox.apache.org
> Betreff: Re: Supressing warnings for missing unicode mappings
> 
> Hi,
> 
> Yes, reducing logging is the way. I don't know if there are more.
> 
> I'd also be interested in the "unionsq" file, I wonder if this is a
> false positive. This happens because "uniNNNN" is a valid glyph name.
> There is unionsqdisplay and unionsqtext too, but not unionsq.
> 
> Tilman
> 
> On 02.08.2023 11:20, Brangs, Erik wrote:
> > Hi,
> >
> > we're using PDFBox 3.0.0-beta1 to extract text from PDFs. This produces lots of
> warnings about missing unicode mappings. Is there a programmatic way to suppress
> those messages or would it be better to configure the logging to do that?
> >
> > If it's better to configure logging, I would try to configure the logging level for
> PDSimpleFont, PDType0Font, PDFont and GlyphList. Are those all relevant loggers or
> are there any more?
> >
> > For GlyphList, the most common warning is "Not a number in Unicode character
> name: unionsq". I also saw a warning "Not a number in Unicode character name:
> users" but only for one PDF.
> >
> >
> > Mit freundlichen Grüßen
> > Erik Brangs
> > *** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek ***
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Supressing warnings for missing unicode mappings

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

Yes, reducing logging is the way. I don't know if there are more.

I'd also be interested in the "unionsq" file, I wonder if this is a 
false positive. This happens because "uniNNNN" is a valid glyph name. 
There is unionsqdisplay and unionsqtext too, but not unionsq.

Tilman

On 02.08.2023 11:20, Brangs, Erik wrote:
> Hi,
>
> we're using PDFBox 3.0.0-beta1 to extract text from PDFs. This produces lots of warnings about missing unicode mappings. Is there a programmatic way to suppress those messages or would it be better to configure the logging to do that?
>
> If it's better to configure logging, I would try to configure the logging level for PDSimpleFont, PDType0Font, PDFont and GlyphList. Are those all relevant loggers or are there any more?
>
> For GlyphList, the most common warning is "Not a number in Unicode character name: unionsq". I also saw a warning "Not a number in Unicode character name: users" but only for one PDF.
>
>
> Mit freundlichen Grüßen
> Erik Brangs
> *** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek ***
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org