You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Giovanni De Stefano <gi...@servisoft.be> on 2019/04/01 21:26:32 UTC

No Unicode mapping for xx (xx) in font null

Hello,

I am having trouble extracting data from a bunch of pdf.

The output I get is something like:


cd\pYe[Ŷd_z\ndYedYnspn\̀\ah\spYv\cnàdYỲaY€d̀̀[̀dYedcYcapnh\spcYaeo\p\ch_ah\zdcŶ€‚ƒmYr[\YaY_dt_\cYndcYnsotihdpndcw

„S……KMV†SNMWL…MHULIRL…M‡ˆ„MXY‰spY]YŠdY€̂‚YpfdchYnsotihdphYr[fdpYoah\‹_dYedY_do\cdYdhŒs[Y_ie[nh\spYedcYann_s\ccdodphcYef\otuhYdhYaodpedc

v\cnàdcYaeo\p\ch_ah\zdcYdpYoah\‹_dYef\otuhYc[_ỲdcY_dzdp[cmYedYhadcYacc\o\̀idcYa[Y\otuhcYc[_ỲdcY_dzdp[cYdhYedYe_s\hcYdhYhadcYe\zd_cdcwYŠdc

_do\cdcYdhŒs[Y_ie[nh\spcYefaodpedcYŽ‚YpdYcsphYespnYtacYz\cidcwYŠdYo\p\ch_dYedcYq\papndcYs[YcspYvspnh\sppa\_dYeìi[iY_dchdphYespnYnsotihdphc

a_hwY}Ya__ghiYe[Y_idphmYjkw|lwjkljƒYw

The logs inform me that that many Unicode mapping are missing:


WARN  No Unicode mapping for 87 (87) in font null

WARN  No Unicode mapping for 88 (88) in font null

WARN  No Unicode mapping for .notdef (89) in font null

WARN  No Unicode mapping for 90 (90) in font null

WARN  No Unicode mapping for 91 (91) in font null

WARN  No Unicode mapping for 92 (92) in font null

I can reproduce this behavior with a vanilla Tika Server 1.20.

I attach the pdf here.

What could be wrong? Any idea on the steps I can take to properly extract metadata and body?

Thanks a lot,
Giovanni

Re: No Unicode mapping for xx (xx) in font null

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 04.04.2019 um 14:59 schrieb Tim Allison:
>> One could create such a database by using "good" pdfs as source, or
>> (more simple) by just getting the original fonts and going though them.
> This has occurred to me, and I'm happy to hear that Peter has been
> making headway on this option.
>
> Some questions:
> 1) Where are the hooks in PDFBox to load an external font (or better,
> if sufficient: the codepoint mappings) gathered via this method?
> Would we need the full font, or could we inject only codepoint
> mappings (some spaces are calculated by character width/distance from
> last character, so we'd need the font info, right)?

I haven't understood all this (but it's late here). I also haven't 
understood Peter's text.

To get a TTF font in fontbox: new TTFParser().parse(). From there you 
can access the path, the unicode, etc. For a better understanding, load 
the font into DTL OTMaster 3.7 light and look at the tables. FontForge 
sucks IMHO, its gui is terrible.

For type1 fonts I would have to search...

To get the path of a glyph you hit in PDFBox - see the 
DrawPrintTextLocations.java example, search for "cyan". The methods are 
different for each font type.

With "good" PDFs you don't need to access fontbox directly, you could 
use the unicode given by the stripper together with the path and then 
build a table from that.

But the question is, can/should we use the drawing path as a key or is 
there something else that is unique and that would work with a subsetted 
font? Peter what is your "key" to get the unicode?

Tilman


>
> 2) We can't rely on fonts having unique names across a corpus...right?
>   How would we pick from multiple options with the same name -- OOV%?
>
> 3) If one happened to have ~500k PDFs available, is there example code
> of how to pull out codepoint mappings/fonts with PDFBox?  This study
> might give some indication of feasibility of this approach across a
> heterogeneous corpus.
>
> Cheers,
>
>             Tim
>
> On Thu, Apr 4, 2019 at 7:58 AM Peter Murray-Rust <pm...@cam.ac.uk> wrote:
>> Completely agree with Tilman
>> I've made a large start with over 100 fonts (mainly from
>> science/tech/eng/math). See
>> https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml
>> and many more in
>> https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/
>> <https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml>
>>
>> Here's a typical one - from the Mathematical PI range:9 (apologies for
>> formatting)
>>
>> <!-- NOT UNICODE -->
>> <codePoint unicode="U+2264" decimal="35" name="numbersign" note="LESS-THAN
>> OR EQUAL TO"/>
>> <codePoint unicode="U+2265" decimal="36" name="dollar" note="GREATER-THAN
>> OR EQUAL TO"/>
>> <codePoint unicode="U+2245" decimal="38" name="ampersand" note="APPROXIMATELY
>> EQUAL TO"/>
>> <codePoint unicode="U+003C" decimal="44" name="comma" note="LESS-THAN SIGN"
>> />
>> Note how the codepoint and name have no relation to the glyph.
>>
>> Many of these fonts are proprietary and so impossible to obtain.
>>
>> I'd be happy to hear of others prepared to help with managing these - I've
>> spent months...
>>
>>
>>
>> On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Am 02.04.2019 um 03:59 schrieb Tim Allison:
>>>> Again, short of AI, your best bet is to run OCR (tesseract) on these
>>> files.
>>>
>>>
>>> Another possible idea: create a huge database of fonts names, glyph
>>> paths (or a hash of it) and unicodes.
>>>
>>> One could create such a database by using "good" pdfs as source, or
>>> (more simple) by just getting the original fonts and going though them.
>>>
>>> The main problem might be that such a database is possibly huge or too
>>> slow. But it would bring better results than OCR.
>>>
>>> Tilman
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>> --
>> Peter Murray-Rust
>> Reader Emeritus in Molecular Informatics
>> Unilever Centre, Dept. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

> One could create such a database by using "good" pdfs as source, or
> (more simple) by just getting the original fonts and going though them.

This has occurred to me, and I'm happy to hear that Peter has been
making headway on this option.

Some questions:
1) Where are the hooks in PDFBox to load an external font (or better,
if sufficient: the codepoint mappings) gathered via this method?
Would we need the full font, or could we inject only codepoint
mappings (some spaces are calculated by character width/distance from
last character, so we'd need the font info, right)?

2) We can't rely on fonts having unique names across a corpus...right?
 How would we pick from multiple options with the same name -- OOV%?

3) If one happened to have ~500k PDFs available, is there example code
of how to pull out codepoint mappings/fonts with PDFBox?  This study
might give some indication of feasibility of this approach across a
heterogeneous corpus.

Cheers,

           Tim

On Thu, Apr 4, 2019 at 7:58 AM Peter Murray-Rust <pm...@cam.ac.uk> wrote:
>
> Completely agree with Tilman
> I've made a large start with over 100 fonts (mainly from
> science/tech/eng/math). See
> https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml
> and many more in
> https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/
> <https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml>
>
> Here's a typical one - from the Mathematical PI range:9 (apologies for
> formatting)
>
> <!-- NOT UNICODE -->
> <codePoint unicode="U+2264" decimal="35" name="numbersign" note="LESS-THAN
> OR EQUAL TO"/>
> <codePoint unicode="U+2265" decimal="36" name="dollar" note="GREATER-THAN
> OR EQUAL TO"/>
> <codePoint unicode="U+2245" decimal="38" name="ampersand" note="APPROXIMATELY
> EQUAL TO"/>
> <codePoint unicode="U+003C" decimal="44" name="comma" note="LESS-THAN SIGN"
> />
> Note how the codepoint and name have no relation to the glyph.
>
> Many of these fonts are proprietary and so impossible to obtain.
>
> I'd be happy to hear of others prepared to help with managing these - I've
> spent months...
>
>
>
> On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
> > Am 02.04.2019 um 03:59 schrieb Tim Allison:
> > > Again, short of AI, your best bet is to run OCR (tesseract) on these
> > files.
> >
> >
> > Another possible idea: create a huge database of fonts names, glyph
> > paths (or a hash of it) and unicodes.
> >
> > One could create such a database by using "good" pdfs as source, or
> > (more simple) by just getting the original fonts and going though them.
> >
> > The main problem might be that such a database is possibly huge or too
> > slow. But it would bring better results than OCR.
> >
> > Tilman
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
> >
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: No Unicode mapping for xx (xx) in font null

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

Completely agree with Tilman
I've made a large start with over 100 fonts (mainly from
science/tech/eng/math). See
https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml
and many more in
https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/
<https://github.com/petermr/cephis/blob/master/src/main/resources/org/contentmine/pdf2svg/codepoints/symbol/advpi1.xml>

Here's a typical one - from the Mathematical PI range:9 (apologies for
formatting)

<codePoint unicode="U+2264" decimal="35" name="numbersign" note="LESS-THAN
OR EQUAL TO"/>
<codePoint unicode="U+2265" decimal="36" name="dollar" note="GREATER-THAN
OR EQUAL TO"/>
<codePoint unicode="U+2245" decimal="38" name="ampersand" note="APPROXIMATELY
EQUAL TO"/>
<codePoint unicode="U+003C" decimal="44" name="comma" note="LESS-THAN SIGN"
/>
Note how the codepoint and name have no relation to the glyph.

Many of these fonts are proprietary and so impossible to obtain.

I'd be happy to hear of others prepared to help with managing these - I've
spent months...

On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 02.04.2019 um 03:59 schrieb Tim Allison:
> > Again, short of AI, your best bet is to run OCR (tesseract) on these
> files.
>
>
> Another possible idea: create a huge database of fonts names, glyph
> paths (or a hash of it) and unicodes.
>
> One could create such a database by using "good" pdfs as source, or
> (more simple) by just getting the original fonts and going though them.
>
> The main problem might be that such a database is possibly huge or too
> slow. But it would bring better results than OCR.
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 02.04.2019 um 03:59 schrieb Tim Allison:
> Again, short of AI, your best bet is to run OCR (tesseract) on these files.


Another possible idea: create a huge database of fonts names, glyph 
paths (or a hash of it) and unicodes.

One could create such a database by using "good" pdfs as source, or 
(more simple) by just getting the original fonts and going though them.

The main problem might be that such a database is possibly huge or too 
slow. But it would bring better results than OCR.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

> A hybrid solution would indeed be a much better approach in my case.

Y.  Tesseract is not optimized for speed.

On Mon, Apr 8, 2019 at 5:08 PM Giovanni De Stefano
<gi...@servisoft.be> wrote:
>
> Thanks a lot!
>
> I am now watching https://issues.apache.org/jira/browse/TIKA-2749.
>
> From the test I made performance dropped way too much when I impost to always OCR pdf as image instead of just extracting the text :-(
>
> A hybrid solution would indeed be a much better approach in my case.
>
>
> Giovanni
> On 5 Apr 2019, 20:12 +0200, Tim Allison <ta...@apache.org>, wrote:
>
> Also, does anybody know when 1.21 is due? :-)
>
>
> Both POI and PDFBox are about to make releases. I'd be willing to run
> a release of Tika once those are out (two or so weeks)...
>
> Fellow devs, What do you think of 1.21 shortly after POI and PDFBox
> are released?
>
> Do you think that would be a decent strategy?
>
> Yep, exactly. I _may_ have time to implement a "first steps" of
> https://issues.apache.org/jira/browse/TIKA-2749 before the
> release...so maybe you won't have to make changes on your side.
>
> On Thu, Apr 4, 2019 at 5:06 PM Giovanni De Stefano
> <gi...@servisoft.be> wrote:
>
>
> I could use the number of unmapped unicode chars are in a page to decide whether a PDF should be parsed “normally” or OCR.
>
> Do you think that would be a decent strategy?
>
> Also, does anybody know when 1.21 is due? :-)
>
>
> Giovanni
> On 4 Apr 2019, 13:06 +0200, Tim Allison <ta...@apache.org>, wrote:
>
> And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> many unmapped chars there were per page. If there's more than one
> page, you'll get a parallel array of ints. These were the results on
> your doc:
>
> 0: pdf:unmappedUnicodeCharsPerPage : 3242
> 0: pdf:charsPerPage : 3242
>
> Note, you'll either have to retrieve the Tika Metadata object after
> the parse or use the RecursiveParserWrapper (-j /rmeta). These stats
> won't show up in the xhtml because they are calculated after the first
> bit of content has been written.
>
> On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
> <gi...@servisoft.be> wrote:
>
>
> Hello Tim, Peter,
>
> Thank you for your replies.
>
> It seems indeed that the only solution is to include Tesseract in my processing pipeline.
>
> I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
>
> I guess this might fall into the “obfuscation” approach some software adopt :-(
>
> Cheers,
>
> Giovanni
> On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
>
> I agree with Tim's analysis.
>
> Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> are not mapped onto Unicode. There are two indications (codepoints and
> names which can often be used to create a partial mapping. I spent a *lot*
> of time doing this manually. For example
>
>
> WARN No Unicode mapping for .notdef (89) in font null
>
> WARN No Unicode mapping for 90 (90) in font null
> <<<
> The first field is the name , the second the codepoint. In your example the
> font (probably) uses codepoints consistently within that particular font,
> e.g. 89 is consistently the same character and different from 90. The names
> *may* differentiate characters. Here is my (handedited) entry for CMSY
> (used by LaTeX for symbols):
>
> <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
>
> But this will only work for this particularly font.
>
> If you are only dealing with anglophone alphanumeric from a single
> source/font you can probably work out a table. You are welcome to use mine
> (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> For example distinguishing between the many types of dash/minus/underline
> depend on having a system trained on these. Relative heights and size are a
> major problem
>
> In general, typesetters and their software are only concerned with the
> visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> for "not-equals". Anyone having work typeset in PDF should insist that a
> Unicode font is used. Better still avoid PDF.
>
>
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Giovanni De Stefano <gi...@servisoft.be>.

Thanks a lot!

I am now watching https://issues.apache.org/jira/browse/TIKA-2749.

From the test I made performance dropped way too much when I impost to always OCR pdf as image instead of just extracting the text :-(

A hybrid solution would indeed be a much better approach in my case.


Giovanni
On 5 Apr 2019, 20:12 +0200, Tim Allison <ta...@apache.org>, wrote:
> > Also, does anybody know when 1.21 is due? :-)
>
> Both POI and PDFBox are about to make releases. I'd be willing to run
> a release of Tika once those are out (two or so weeks)...
>
> Fellow devs, What do you think of 1.21 shortly after POI and PDFBox
> are released?
>
> > Do you think that would be a decent strategy?
> Yep, exactly. I _may_ have time to implement a "first steps" of
> https://issues.apache.org/jira/browse/TIKA-2749 before the
> release...so maybe you won't have to make changes on your side.
>
> On Thu, Apr 4, 2019 at 5:06 PM Giovanni De Stefano
> <gi...@servisoft.be> wrote:
> >
> > I could use the number of unmapped unicode chars are in a page to decide whether a PDF should be parsed “normally” or OCR.
> >
> > Do you think that would be a decent strategy?
> >
> > Also, does anybody know when 1.21 is due? :-)
> >
> >
> > Giovanni
> > On 4 Apr 2019, 13:06 +0200, Tim Allison <ta...@apache.org>, wrote:
> >
> > And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> > many unmapped chars there were per page. If there's more than one
> > page, you'll get a parallel array of ints. These were the results on
> > your doc:
> >
> > 0: pdf:unmappedUnicodeCharsPerPage : 3242
> > 0: pdf:charsPerPage : 3242
> >
> > Note, you'll either have to retrieve the Tika Metadata object after
> > the parse or use the RecursiveParserWrapper (-j /rmeta). These stats
> > won't show up in the xhtml because they are calculated after the first
> > bit of content has been written.
> >
> > On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
> > <gi...@servisoft.be> wrote:
> >
> >
> > Hello Tim, Peter,
> >
> > Thank you for your replies.
> >
> > It seems indeed that the only solution is to include Tesseract in my processing pipeline.
> >
> > I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
> >
> > I guess this might fall into the “obfuscation” approach some software adopt :-(
> >
> > Cheers,
> >
> > Giovanni
> > On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
> >
> > I agree with Tim's analysis.
> >
> > Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> > are not mapped onto Unicode. There are two indications (codepoints and
> > names which can often be used to create a partial mapping. I spent a *lot*
> > of time doing this manually. For example
> >
> >
> > WARN No Unicode mapping for .notdef (89) in font null
> >
> > WARN No Unicode mapping for 90 (90) in font null
> > <<<
> > The first field is the name , the second the codepoint. In your example the
> > font (probably) uses codepoints consistently within that particular font,
> > e.g. 89 is consistently the same character and different from 90. The names
> > *may* differentiate characters. Here is my (handedited) entry for CMSY
> > (used by LaTeX for symbols):
> >
> > <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
> >
> > But this will only work for this particularly font.
> >
> > If you are only dealing with anglophone alphanumeric from a single
> > source/font you can probably work out a table. You are welcome to use mine
> > (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> > may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> > For example distinguishing between the many types of dash/minus/underline
> > depend on having a system trained on these. Relative heights and size are a
> > major problem
> >
> > In general, typesetters and their software are only concerned with the
> > visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> > for "not-equals". Anyone having work typeset in PDF should insist that a
> > Unicode font is used. Better still avoid PDF.
> >
> >
> >
> > --
> > Peter Murray-Rust
> > Reader Emeritus in Molecular Informatics
> > Unilever Centre, Dept. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

>Also, does anybody know when 1.21 is due? :-)

Both POI and PDFBox are about to make releases.  I'd be willing to run
a release of Tika once those are out (two or so weeks)...

Fellow devs,  What do you think of 1.21 shortly after POI and PDFBox
are released?

>Do you think that would be a decent strategy?
Yep, exactly.  I _may_ have time to implement a "first steps" of
https://issues.apache.org/jira/browse/TIKA-2749 before the
release...so maybe you won't have to make changes on your side.

On Thu, Apr 4, 2019 at 5:06 PM Giovanni De Stefano
<gi...@servisoft.be> wrote:
>
> I could use the number of unmapped unicode chars are in a page to decide whether a PDF should be parsed “normally” or OCR.
>
> Do you think that would be a decent strategy?
>
> Also, does anybody know when 1.21 is due? :-)
>
>
> Giovanni
> On 4 Apr 2019, 13:06 +0200, Tim Allison <ta...@apache.org>, wrote:
>
> And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> many unmapped chars there were per page. If there's more than one
> page, you'll get a parallel array of ints. These were the results on
> your doc:
>
> 0: pdf:unmappedUnicodeCharsPerPage : 3242
> 0: pdf:charsPerPage : 3242
>
> Note, you'll either have to retrieve the Tika Metadata object after
> the parse or use the RecursiveParserWrapper (-j /rmeta). These stats
> won't show up in the xhtml because they are calculated after the first
> bit of content has been written.
>
> On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
> <gi...@servisoft.be> wrote:
>
>
> Hello Tim, Peter,
>
> Thank you for your replies.
>
> It seems indeed that the only solution is to include Tesseract in my processing pipeline.
>
> I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
>
> I guess this might fall into the “obfuscation” approach some software adopt :-(
>
> Cheers,
>
> Giovanni
> On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
>
> I agree with Tim's analysis.
>
> Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> are not mapped onto Unicode. There are two indications (codepoints and
> names which can often be used to create a partial mapping. I spent a *lot*
> of time doing this manually. For example
>
>
> WARN No Unicode mapping for .notdef (89) in font null
>
> WARN No Unicode mapping for 90 (90) in font null
> <<<
> The first field is the name , the second the codepoint. In your example the
> font (probably) uses codepoints consistently within that particular font,
> e.g. 89 is consistently the same character and different from 90. The names
> *may* differentiate characters. Here is my (handedited) entry for CMSY
> (used by LaTeX for symbols):
>
> <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
>
> But this will only work for this particularly font.
>
> If you are only dealing with anglophone alphanumeric from a single
> source/font you can probably work out a table. You are welcome to use mine
> (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> For example distinguishing between the many types of dash/minus/underline
> depend on having a system trained on these. Relative heights and size are a
> major problem
>
> In general, typesetters and their software are only concerned with the
> visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> for "not-equals". Anyone having work typeset in PDF should insist that a
> Unicode font is used. Better still avoid PDF.
>
>
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Giovanni De Stefano <gi...@servisoft.be>.

I could use the number of unmapped unicode chars are in a page to decide whether a PDF should be parsed “normally” or OCR.

Do you think that would be a decent strategy?

Also, does anybody know when 1.21 is due? :-)


Giovanni
On 4 Apr 2019, 13:06 +0200, Tim Allison <ta...@apache.org>, wrote:
> And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> many unmapped chars there were per page. If there's more than one
> page, you'll get a parallel array of ints. These were the results on
> your doc:
>
> 0: pdf:unmappedUnicodeCharsPerPage : 3242
> 0: pdf:charsPerPage : 3242
>
> Note, you'll either have to retrieve the Tika Metadata object after
> the parse or use the RecursiveParserWrapper (-j /rmeta). These stats
> won't show up in the xhtml because they are calculated after the first
> bit of content has been written.
>
> On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
> <gi...@servisoft.be> wrote:
> >
> > Hello Tim, Peter,
> >
> > Thank you for your replies.
> >
> > It seems indeed that the only solution is to include Tesseract in my processing pipeline.
> >
> > I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
> >
> > I guess this might fall into the “obfuscation” approach some software adopt :-(
> >
> > Cheers,
> >
> > Giovanni
> > On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
> >
> > I agree with Tim's analysis.
> >
> > Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> > are not mapped onto Unicode. There are two indications (codepoints and
> > names which can often be used to create a partial mapping. I spent a *lot*
> > of time doing this manually. For example
> >
> >
> > WARN No Unicode mapping for .notdef (89) in font null
> >
> > WARN No Unicode mapping for 90 (90) in font null
> > <<<
> > The first field is the name , the second the codepoint. In your example the
> > font (probably) uses codepoints consistently within that particular font,
> > e.g. 89 is consistently the same character and different from 90. The names
> > *may* differentiate characters. Here is my (handedited) entry for CMSY
> > (used by LaTeX for symbols):
> >
> > <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
> >
> > But this will only work for this particularly font.
> >
> > If you are only dealing with anglophone alphanumeric from a single
> > source/font you can probably work out a table. You are welcome to use mine
> > (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> > may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> > For example distinguishing between the many types of dash/minus/underline
> > depend on having a system trained on these. Relative heights and size are a
> > major problem
> >
> > In general, typesetters and their software are only concerned with the
> > visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> > for "not-equals". Anyone having work typeset in PDF should insist that a
> > Unicode font is used. Better still avoid PDF.
> >
> >
> >
> > --
> > Peter Murray-Rust
> > Reader Emeritus in Molecular Informatics
> > Unilever Centre, Dept. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Giovanni De Stefano <gi...@servisoft.be>.

I could use the number of unmapped unicode chars are in a page to decide whether a PDF should be parsed “normally” or OCR.

Do you think that would be a decent strategy?

Also, does anybody know when 1.21 is due? :-)


Giovanni
On 4 Apr 2019, 13:06 +0200, Tim Allison <ta...@apache.org>, wrote:
> And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> many unmapped chars there were per page. If there's more than one
> page, you'll get a parallel array of ints. These were the results on
> your doc:
>
> 0: pdf:unmappedUnicodeCharsPerPage : 3242
> 0: pdf:charsPerPage : 3242
>
> Note, you'll either have to retrieve the Tika Metadata object after
> the parse or use the RecursiveParserWrapper (-j /rmeta). These stats
> won't show up in the xhtml because they are calculated after the first
> bit of content has been written.
>
> On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
> <gi...@servisoft.be> wrote:
> >
> > Hello Tim, Peter,
> >
> > Thank you for your replies.
> >
> > It seems indeed that the only solution is to include Tesseract in my processing pipeline.
> >
> > I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
> >
> > I guess this might fall into the “obfuscation” approach some software adopt :-(
> >
> > Cheers,
> >
> > Giovanni
> > On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
> >
> > I agree with Tim's analysis.
> >
> > Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> > are not mapped onto Unicode. There are two indications (codepoints and
> > names which can often be used to create a partial mapping. I spent a *lot*
> > of time doing this manually. For example
> >
> >
> > WARN No Unicode mapping for .notdef (89) in font null
> >
> > WARN No Unicode mapping for 90 (90) in font null
> > <<<
> > The first field is the name , the second the codepoint. In your example the
> > font (probably) uses codepoints consistently within that particular font,
> > e.g. 89 is consistently the same character and different from 90. The names
> > *may* differentiate characters. Here is my (handedited) entry for CMSY
> > (used by LaTeX for symbols):
> >
> > <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
> >
> > But this will only work for this particularly font.
> >
> > If you are only dealing with anglophone alphanumeric from a single
> > source/font you can probably work out a table. You are welcome to use mine
> > (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> > may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> > For example distinguishing between the many types of dash/minus/underline
> > depend on having a system trained on these. Relative heights and size are a
> > major problem
> >
> > In general, typesetters and their software are only concerned with the
> > visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> > for "not-equals". Anyone having work typeset in PDF should insist that a
> > Unicode font is used. Better still avoid PDF.
> >
> >
> >
> > --
> > Peter Murray-Rust
> > Reader Emeritus in Molecular Informatics
> > Unilever Centre, Dept. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

a parallel array -> parallel arrays

-j -> -J (tika-app commandline options)

On Thu, Apr 4, 2019 at 7:06 AM Tim Allison <ta...@apache.org> wrote:
>
> And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> many unmapped chars there were per page.  If there's more than one
> page, you'll get a parallel array of ints.  These were the results on
> your doc:
>
> 0: pdf:unmappedUnicodeCharsPerPage : 3242
> 0: pdf:charsPerPage : 3242
>
> Note, you'll either have to retrieve the Tika Metadata object after
> the parse or use the RecursiveParserWrapper (-j /rmeta).  These stats
> won't show up in the xhtml because they are calculated after the first
> bit of content has been written.
>
> On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
> <gi...@servisoft.be> wrote:
> >
> > Hello Tim, Peter,
> >
> > Thank you for your replies.
> >
> > It seems indeed that the only solution is to include Tesseract in my processing pipeline.
> >
> > I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
> >
> > I guess this might fall into the “obfuscation” approach some software adopt :-(
> >
> > Cheers,
> >
> > Giovanni
> > On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
> >
> > I agree with Tim's analysis.
> >
> > Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> > are not mapped onto Unicode. There are two indications (codepoints and
> > names which can often be used to create a partial mapping. I spent a *lot*
> > of time doing this manually. For example
> >
> >
> > WARN No Unicode mapping for .notdef (89) in font null
> >
> > WARN No Unicode mapping for 90 (90) in font null
> > <<<
> > The first field is the name , the second the codepoint. In your example the
> > font (probably) uses codepoints consistently within that particular font,
> > e.g. 89 is consistently the same character and different from 90. The names
> > *may* differentiate characters. Here is my (handedited) entry for CMSY
> > (used by LaTeX for symbols):
> >
> > <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
> >
> > But this will only work for this particularly font.
> >
> > If you are only dealing with anglophone alphanumeric from a single
> > source/font you can probably work out a table. You are welcome to use mine
> > (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> > may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> > For example distinguishing between the many types of dash/minus/underline
> > depend on having a system trained on these. Relative heights and size are a
> > major problem
> >
> > In general, typesetters and their software are only concerned with the
> > visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> > for "not-equals". Anyone having work typeset in PDF should insist that a
> > Unicode font is used. Better still avoid PDF.
> >
> >
> >
> > --
> > Peter Murray-Rust
> > Reader Emeritus in Molecular Informatics
> > Unilever Centre, Dept. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

a parallel array -> parallel arrays

-j -> -J (tika-app commandline options)

On Thu, Apr 4, 2019 at 7:06 AM Tim Allison <ta...@apache.org> wrote:
>
> And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> many unmapped chars there were per page.  If there's more than one
> page, you'll get a parallel array of ints.  These were the results on
> your doc:
>
> 0: pdf:unmappedUnicodeCharsPerPage : 3242
> 0: pdf:charsPerPage : 3242
>
> Note, you'll either have to retrieve the Tika Metadata object after
> the parse or use the RecursiveParserWrapper (-j /rmeta).  These stats
> won't show up in the xhtml because they are calculated after the first
> bit of content has been written.
>
> On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
> <gi...@servisoft.be> wrote:
> >
> > Hello Tim, Peter,
> >
> > Thank you for your replies.
> >
> > It seems indeed that the only solution is to include Tesseract in my processing pipeline.
> >
> > I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
> >
> > I guess this might fall into the “obfuscation” approach some software adopt :-(
> >
> > Cheers,
> >
> > Giovanni
> > On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
> >
> > I agree with Tim's analysis.
> >
> > Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> > are not mapped onto Unicode. There are two indications (codepoints and
> > names which can often be used to create a partial mapping. I spent a *lot*
> > of time doing this manually. For example
> >
> >
> > WARN No Unicode mapping for .notdef (89) in font null
> >
> > WARN No Unicode mapping for 90 (90) in font null
> > <<<
> > The first field is the name , the second the codepoint. In your example the
> > font (probably) uses codepoints consistently within that particular font,
> > e.g. 89 is consistently the same character and different from 90. The names
> > *may* differentiate characters. Here is my (handedited) entry for CMSY
> > (used by LaTeX for symbols):
> >
> > <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
> >
> > But this will only work for this particularly font.
> >
> > If you are only dealing with anglophone alphanumeric from a single
> > source/font you can probably work out a table. You are welcome to use mine
> > (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> > may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> > For example distinguishing between the many types of dash/minus/underline
> > depend on having a system trained on these. Relative heights and size are a
> > major problem
> >
> > In general, typesetters and their software are only concerned with the
> > visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> > for "not-equals". Anyone having work typeset in PDF should insist that a
> > Unicode font is used. Better still avoid PDF.
> >
> >
> >
> > --
> > Peter Murray-Rust
> > Reader Emeritus in Molecular Informatics
> > Unilever Centre, Dept. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

And with TIKA-2846 (thanks to Tilman), you will now be able to see how
many unmapped chars there were per page.  If there's more than one
page, you'll get a parallel array of ints.  These were the results on
your doc:

0: pdf:unmappedUnicodeCharsPerPage : 3242
0: pdf:charsPerPage : 3242

Note, you'll either have to retrieve the Tika Metadata object after
the parse or use the RecursiveParserWrapper (-j /rmeta).  These stats
won't show up in the xhtml because they are calculated after the first
bit of content has been written.

On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
<gi...@servisoft.be> wrote:
>
> Hello Tim, Peter,
>
> Thank you for your replies.
>
> It seems indeed that the only solution is to include Tesseract in my processing pipeline.
>
> I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
>
> I guess this might fall into the “obfuscation” approach some software adopt :-(
>
> Cheers,
>
> Giovanni
> On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
>
> I agree with Tim's analysis.
>
> Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> are not mapped onto Unicode. There are two indications (codepoints and
> names which can often be used to create a partial mapping. I spent a *lot*
> of time doing this manually. For example
>
>
> WARN No Unicode mapping for .notdef (89) in font null
>
> WARN No Unicode mapping for 90 (90) in font null
> <<<
> The first field is the name , the second the codepoint. In your example the
> font (probably) uses codepoints consistently within that particular font,
> e.g. 89 is consistently the same character and different from 90. The names
> *may* differentiate characters. Here is my (handedited) entry for CMSY
> (used by LaTeX for symbols):
>
> <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
>
> But this will only work for this particularly font.
>
> If you are only dealing with anglophone alphanumeric from a single
> source/font you can probably work out a table. You are welcome to use mine
> (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> For example distinguishing between the many types of dash/minus/underline
> depend on having a system trained on these. Relative heights and size are a
> major problem
>
> In general, typesetters and their software are only concerned with the
> visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> for "not-equals". Anyone having work typeset in PDF should insist that a
> Unicode font is used. Better still avoid PDF.
>
>
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

And with TIKA-2846 (thanks to Tilman), you will now be able to see how
many unmapped chars there were per page.  If there's more than one
page, you'll get a parallel array of ints.  These were the results on
your doc:

0: pdf:unmappedUnicodeCharsPerPage : 3242
0: pdf:charsPerPage : 3242

Note, you'll either have to retrieve the Tika Metadata object after
the parse or use the RecursiveParserWrapper (-j /rmeta).  These stats
won't show up in the xhtml because they are calculated after the first
bit of content has been written.

On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
<gi...@servisoft.be> wrote:
>
> Hello Tim, Peter,
>
> Thank you for your replies.
>
> It seems indeed that the only solution is to include Tesseract in my processing pipeline.
>
> I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.
>
> I guess this might fall into the “obfuscation” approach some software adopt :-(
>
> Cheers,
>
> Giovanni
> On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
>
> I agree with Tim's analysis.
>
> Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> are not mapped onto Unicode. There are two indications (codepoints and
> names which can often be used to create a partial mapping. I spent a *lot*
> of time doing this manually. For example
>
>
> WARN No Unicode mapping for .notdef (89) in font null
>
> WARN No Unicode mapping for 90 (90) in font null
> <<<
> The first field is the name , the second the codepoint. In your example the
> font (probably) uses codepoints consistently within that particular font,
> e.g. 89 is consistently the same character and different from 90. The names
> *may* differentiate characters. Here is my (handedited) entry for CMSY
> (used by LaTeX for symbols):
>
> <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
>
> But this will only work for this particularly font.
>
> If you are only dealing with anglophone alphanumeric from a single
> source/font you can probably work out a table. You are welcome to use mine
> (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> For example distinguishing between the many types of dash/minus/underline
> depend on having a system trained on these. Relative heights and size are a
> major problem
>
> In general, typesetters and their software are only concerned with the
> visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> for "not-equals". Anyone having work typeset in PDF should insist that a
> Unicode font is used. Better still avoid PDF.
>
>
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Signs of corrupt text during the parse -- Was: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

Speaking of this, any recommendations on using information from the
per-page parse to figure out if text might be corrupt...without
wrecking PDFBox's API?

https://issues.apache.org/jira/browse/TIKA-2749?focusedCommentId=16807661&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16807661

---------- Forwarded message ---------
From: Giovanni De Stefano (zxxz) <gi...@servisoft.be>
Date: Tue, Apr 2, 2019 at 4:52 AM
Subject: Re: No Unicode mapping for xx (xx) in font null
To: <us...@pdfbox.apache.org>
Cc: <us...@tika.apache.org>

Hello Tim, Peter,

Thank you for your replies.

It seems indeed that the only solution is to include Tesseract in my
processing pipeline.

I don’t know if it might be useful to future readers, but I noticed
that *all* pdf created with PDF24 are subject to this behavior.

I guess this might fall into the “obfuscation” approach some software adopt :-(

Cheers,

Giovanni
On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:

I agree with Tim's analysis.

Many "legacy" fonts (including unfortunately some of those used by LaTeX)
are not mapped onto Unicode. There are two indications (codepoints and
names which can often be used to create a partial mapping. I spent a *lot*
of time doing this manually. For example

WARN No Unicode mapping for .notdef (89) in font null

WARN No Unicode mapping for 90 (90) in font null
<<<
The first field is the name , the second the codepoint. In your example the
font (probably) uses codepoints consistently within that particular font,
e.g. 89 is consistently the same character and different from 90. The names
*may* differentiate characters. Here is my (handedited) entry for CMSY
(used by LaTeX for symbols):

<codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>

But this will only work for this particularly font.

If you are only dealing with anglophone alphanumeric from a single
source/font you can probably work out a table. You are welcome to use mine
(mainly from scientific / technical publishing) Beyond that OCR/Tesseract
may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
For example distinguishing between the many types of dash/minus/underline
depend on having a system trained on these. Relative heights and size are a
major problem

In general, typesetters and their software are only concerned with the
visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
for "not-equals". Anyone having work typeset in PDF should insist that a
Unicode font is used. Better still avoid PDF.

--
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: No Unicode mapping for xx (xx) in font null

Posted by "Giovanni De Stefano (zxxz)" <gi...@servisoft.be>.

Hello Tim, Peter,

Thank you for your replies.

It seems indeed that the only solution is to include Tesseract in my processing pipeline.

I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.

I guess this might fall into the “obfuscation” approach some software adopt :-(

Cheers,

Giovanni
On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
> I agree with Tim's analysis.
>
> Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> are not mapped onto Unicode. There are two indications (codepoints and
> names which can often be used to create a partial mapping. I spent a *lot*
> of time doing this manually. For example
> > > >
> WARN No Unicode mapping for .notdef (89) in font null
>
> WARN No Unicode mapping for 90 (90) in font null
> <<<
> The first field is the name , the second the codepoint. In your example the
> font (probably) uses codepoints consistently within that particular font,
> e.g. 89 is consistently the same character and different from 90. The names
> *may* differentiate characters. Here is my (handedited) entry for CMSY
> (used by LaTeX for symbols):
>
> <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
>
> But this will only work for this particularly font.
>
> If you are only dealing with anglophone alphanumeric from a single
> source/font you can probably work out a table. You are welcome to use mine
> (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> For example distinguishing between the many types of dash/minus/underline
> depend on having a system trained on these. Relative heights and size are a
> major problem
>
> In general, typesetters and their software are only concerned with the
> visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> for "not-equals". Anyone having work typeset in PDF should insist that a
> Unicode font is used. Better still avoid PDF.
>
>
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by "Giovanni De Stefano (zxxz)" <gi...@servisoft.be>.

Hello Tim, Peter,

Thank you for your replies.

It seems indeed that the only solution is to include Tesseract in my processing pipeline.

I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior.

I guess this might fall into the “obfuscation” approach some software adopt :-(

Cheers,

Giovanni
On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <pm...@cam.ac.uk>, wrote:
> I agree with Tim's analysis.
>
> Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> are not mapped onto Unicode. There are two indications (codepoints and
> names which can often be used to create a partial mapping. I spent a *lot*
> of time doing this manually. For example
> > > >
> WARN No Unicode mapping for .notdef (89) in font null
>
> WARN No Unicode mapping for 90 (90) in font null
> <<<
> The first field is the name , the second the codepoint. In your example the
> font (probably) uses codepoints consistently within that particular font,
> e.g. 89 is consistently the same character and different from 90. The names
> *may* differentiate characters. Here is my (handedited) entry for CMSY
> (used by LaTeX for symbols):
>
> <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
>
> But this will only work for this particularly font.
>
> If you are only dealing with anglophone alphanumeric from a single
> source/font you can probably work out a table. You are welcome to use mine
> (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> For example distinguishing between the many types of dash/minus/underline
> depend on having a system trained on these. Relative heights and size are a
> major problem
>
> In general, typesetters and their software are only concerned with the
> visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> for "not-equals". Anyone having work typeset in PDF should insist that a
> Unicode font is used. Better still avoid PDF.
>
>
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

I agree with Tim's analysis.

Many "legacy" fonts (including unfortunately some of those used by LaTeX)
are not mapped onto Unicode. There are two indications (codepoints and
names which can often be used to create a partial mapping. I spent a *lot*
of time doing this manually. For example
>>>
WARN  No Unicode mapping for .notdef (89) in font null

 WARN  No Unicode mapping for 90 (90) in font null
<<<
The first field is the name , the second the codepoint. In your example the
font (probably) uses codepoints consistently within that particular font,
e.g. 89 is consistently the same character and different from 90. The names
*may* differentiate characters. Here is my (handedited) entry for CMSY
(used by LaTeX for symbols):

<codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>

But this will only work for this particularly font.

If you are only dealing with anglophone alphanumeric from a single
source/font you can probably work out a table. You are welcome to use mine
(mainly from scientific / technical publishing) Beyond that OCR/Tesseract
may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
For example distinguishing between the many types of dash/minus/underline
depend on having a system trained on these. Relative heights and size are a
major problem

In general, typesetters and their software are only concerned with the
visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
for "not-equals". Anyone having work typeset in PDF should insist that a
Unicode font is used. Better still avoid PDF.



-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

I defer to my colleagues on PDFBox, but the unicode mapping warning
means what it says -- there is no way (short of nlp/language
modeling/ai) to reconstruct how to map the characters as stored in the
document to the correct unicode equivalents.  The electronic text
stored within the PDF may or may not reflect the presentation layer,
and with no unicode mapping in the attached...it doesn't.

If you "save as text" the attached file with Adobe Reader, you also get garbage:
FGHIJKLIMHNNOPQMRSMNQTLIPMHMQPQMNLUVWHJQMXYZ[\Y]Y^[_Y'aYbacdYedY'fa__ghiYedYjkljmYnfiha\hY'dYo\p\ch_dYedcYq\papndcYr[\Yiha\hYnsotihdphYts[_
_dodhh_dY'dcYaodpedcYdhYann_

Again, short of AI, your best bet is to run OCR (tesseract) on these files.

Somewhere on my plate is to integrate tika-eval _into_ the PDFParser
to determine when mojibake is being extracted and run OCR
(TIKA-2749?)...that's likely several months off.

Sorry I can't help...

On Mon, Apr 1, 2019 at 5:26 PM Giovanni De Stefano
<gi...@servisoft.be> wrote:
>
> Hello,
>
>
>
> I am having trouble extracting data from a bunch of pdf.
>
>
>
> The output I get is something like:
>
>
>
> cd\pYe[Ŷd_z\ndYedYnspn\̀\ah\spYv\cnàdY ỲaY€d̀̀[̀dYedcYcapnh\spcYaeo\p\ch_ah\zdcY ̂€‚ƒmYr[\YaY_dt_\cYndcYnsotihdpndcw
>
> „S……KMV†SNMWL…MHULIRL…M‡ˆ„MXY‰spY]YŠdY€̂‚YpfdchYnsotihdphYr[fdpYoah\‹_dYedY_do\cdYdhŒs[Y_ie[nh\spYedcYann_s\ccdodphcYef\otuhYdhYaodpedc
>
> v\cnàdcYaeo\p\ch_ah\zdcYdpYoah\‹_dYef\otuhYc[_ỲdcY_dzdp[cmYedYha dcYacc\o\̀idcYa[ Y\otuhcYc[_ỲdcY_dzdp[cYdhYedYe_s\hcYdhYha dcYe\zd_cdcwYŠdc
>
> _do\cdcYdhŒs[Y_ie[nh\spcYefaodpedcYŽ ‚YpdYcsphYespnYtacYz\cidcwYŠdYo\p\ch_dYedcYq\papndcYs[YcspYvspnh\sppa\_dYeìi [iY_dchdphYespnYnsotihdphc
>
>  a_hwY}Ya__ghiYe[Y_i dphmYjkw|lwjkljƒYw
>
>
>
> The logs inform me that that many Unicode mapping are missing:
>
>
>
> WARN  No Unicode mapping for 87 (87) in font null
>
> WARN  No Unicode mapping for 88 (88) in font null
>
> WARN  No Unicode mapping for .notdef (89) in font null
>
> WARN  No Unicode mapping for 90 (90) in font null
>
> WARN  No Unicode mapping for 91 (91) in font null
>
> WARN  No Unicode mapping for 92 (92) in font null
>
>
>
> I can reproduce this behavior with a vanilla Tika Server 1.20.
>
>
>
> I attach the pdf here.
>
>
>
> What could be wrong? Any idea on the steps I can take to properly extract metadata and body?
>
>
>
> Thanks a lot,
>
> Giovanni
>
>
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: No Unicode mapping for xx (xx) in font null

Posted by Tim Allison <ta...@apache.org>.

I defer to my colleagues on PDFBox, but the unicode mapping warning
means what it says -- there is no way (short of nlp/language
modeling/ai) to reconstruct how to map the characters as stored in the
document to the correct unicode equivalents.  The electronic text
stored within the PDF may or may not reflect the presentation layer,
and with no unicode mapping in the attached...it doesn't.

If you "save as text" the attached file with Adobe Reader, you also get garbage:
FGHIJKLIMHNNOPQMRSMNQTLIPMHMQPQMNLUVWHJQMXYZ[\Y]Y^[_Y'aYbacdYedY'fa__ghiYedYjkljmYnfiha\hY'dYo\p\ch_dYedcYq\papndcYr[\Yiha\hYnsotihdphYts[_
_dodhh_dY'dcYaodpedcYdhYann_

Again, short of AI, your best bet is to run OCR (tesseract) on these files.

Somewhere on my plate is to integrate tika-eval _into_ the PDFParser
to determine when mojibake is being extracted and run OCR
(TIKA-2749?)...that's likely several months off.

Sorry I can't help...

On Mon, Apr 1, 2019 at 5:26 PM Giovanni De Stefano
<gi...@servisoft.be> wrote:
>
> Hello,
>
>
>
> I am having trouble extracting data from a bunch of pdf.
>
>
>
> The output I get is something like:
>
>
>
> cd\pYe[Ŷd_z\ndYedYnspn\̀\ah\spYv\cnàdY ỲaY€d̀̀[̀dYedcYcapnh\spcYaeo\p\ch_ah\zdcY ̂€‚ƒmYr[\YaY_dt_\cYndcYnsotihdpndcw
>
> „S……KMV†SNMWL…MHULIRL…M‡ˆ„MXY‰spY]YŠdY€̂‚YpfdchYnsotihdphYr[fdpYoah\‹_dYedY_do\cdYdhŒs[Y_ie[nh\spYedcYann_s\ccdodphcYef\otuhYdhYaodpedc
>
> v\cnàdcYaeo\p\ch_ah\zdcYdpYoah\‹_dYef\otuhYc[_ỲdcY_dzdp[cmYedYha dcYacc\o\̀idcYa[ Y\otuhcYc[_ỲdcY_dzdp[cYdhYedYe_s\hcYdhYha dcYe\zd_cdcwYŠdc
>
> _do\cdcYdhŒs[Y_ie[nh\spcYefaodpedcYŽ ‚YpdYcsphYespnYtacYz\cidcwYŠdYo\p\ch_dYedcYq\papndcYs[YcspYvspnh\sppa\_dYeìi [iY_dchdphYespnYnsotihdphc
>
>  a_hwY}Ya__ghiYe[Y_i dphmYjkw|lwjkljƒYw
>
>
>
> The logs inform me that that many Unicode mapping are missing:
>
>
>
> WARN  No Unicode mapping for 87 (87) in font null
>
> WARN  No Unicode mapping for 88 (88) in font null
>
> WARN  No Unicode mapping for .notdef (89) in font null
>
> WARN  No Unicode mapping for 90 (90) in font null
>
> WARN  No Unicode mapping for 91 (91) in font null
>
> WARN  No Unicode mapping for 92 (92) in font null
>
>
>
> I can reproduce this behavior with a vanilla Tika Server 1.20.
>
>
>
> I attach the pdf here.
>
>
>
> What could be wrong? Any idea on the steps I can take to properly extract metadata and body?
>
>
>
> Thanks a lot,
>
> Giovanni
>
>
>
>
>
>
>
>