You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Susan Borda <sb...@umich.edu> on 2023/06/26 17:36:42 UTC

Does preflight check for "character encoding"?

Hi All-
I'd like to check PDFs that have character encoding issues, does Preflight
do that? I checked the accessibility of a pdf file in Adobe Pro and it gave
me a "Character encoding -Failed" message. When I checked this same file in
Preflight I got this:

Jun 26, 2023 1:24:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
ensureDisplayProfile
WARNING: ICC profile is Perceptual, ignoring, treating as Display class
Jun 26, 2023 1:24:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
ensureDisplayProfile
WARNING: ICC profile is Perceptual, ignoring, treating as Display class
Jun 26, 2023 1:24:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
ensureDisplayProfile
WARNING: ICC profile is Perceptual, ignoring, treating as Display class
The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b file

When I try to copy/paste the text from this PDF it's all garbage and the
CMap is missing.

Any advice would be greatly appreciated.
Thanks,
susan
-- 
Susan Borda
Digital Preservation Projects Manager
Digital Preservation Unit
University of Michigan Libraries
Buhr Building
sborda@umich.edu
*My office phone number is temporarily disconnected while I work remotely
due to COVID-19. Please contact me via email.*

Re: Does preflight check for "character encoding"?

Posted by Susan Borda <sb...@umich.edu>.
Thanks Tim!

I just ran veraPDF on 140K pdfs using the UA1 flag, it gave me a 9+GB XML
file that I'm parsing now.
Here's the output of running a file where the CMap is missing entirely, I
ran this with the "a1" flag.

I'll try your tika-eval.jar on this file. Previously I ran the Python Tika
against this pile of files. I'll look for the "out of vocabulary" statistic
in that report.
-susan

On Tue, Jun 27, 2023 at 12:25 PM Tim Allison <ta...@apache.org> wrote:

> Over on Apache Tika (via PDFBox!), we report the number of characters
> without Unicode mappings, and, if you add our tika-eval jar, you can also
> get an "out of vocabulary" statistic that is an indicator that extracted
> text is garbage. Happy to chat over on user@tika.apache.org on either of
> those topics.
>
> Would be interesting to see if veraPDF is also extracting unmapped Unicode
> chars...missing/broken fonts etc.
>
> On Tue, Jun 27, 2023 at 11:30 AM Susan Borda <sb...@umich.edu> wrote:
>
> > Thanks Tillman, exactly the info I needed.
> >
> > On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr <TH...@t-online.de>
> > wrote:
> >
> > > Hi,
> > > PDFBox preflight only checks for PDF/A-1b, not for any accessibility
> > > topics. Maybe your PDF isn't meant to be accessible to prevent
> scraping.
> > > Try https://verapdf.org/
> > > Tilman
> > >
> > > On 26.06.2023 19:36, Susan Borda wrote:
> > > > Hi All-
> > > > I'd like to check PDFs that have character encoding issues, does
> > > Preflight
> > > > do that? I checked the accessibility of a pdf file in Adobe Pro and
> it
> > > gave
> > > > me a "Character encoding -Failed" message. When I checked this same
> > file
> > > in
> > > > Preflight I got this:
> > > >
> > > > Jun 26, 2023 1:24:41 PM
> > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > > ensureDisplayProfile
> > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display
> class
> > > > Jun 26, 2023 1:24:41 PM
> > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > > ensureDisplayProfile
> > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display
> class
> > > > Jun 26, 2023 1:24:41 PM
> > > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > > ensureDisplayProfile
> > > > WARNING: ICC profile is Perceptual, ignoring, treating as Display
> class
> > > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b
> > file
> > > >
> > > > When I try to copy/paste the text from this PDF it's all garbage and
> > the
> > > > CMap is missing.
> > > >
> > > > Any advice would be greatly appreciated.
> > > > Thanks,
> > > > susan
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > > For additional commands, e-mail: users-help@pdfbox.apache.org
> > >
> > >
> >
> > --
> > Susan Borda
> > Digital Preservation Projects Manager
> > Digital Preservation Unit
> > University of Michigan Libraries
> > Buhr Building
> > sborda@umich.edu
> > *My office phone number is temporarily disconnected while I work remotely
> > due to COVID-19. Please contact me via email.*
> >
>


-- 
Susan Borda
Digital Preservation Projects Manager
Digital Preservation Unit
University of Michigan Libraries
Buhr Building
sborda@umich.edu
*My office phone number is temporarily disconnected while I work remotely
due to COVID-19. Please contact me via email.*

Re: Does preflight check for "character encoding"?

Posted by Tim Allison <ta...@apache.org>.
Over on Apache Tika (via PDFBox!), we report the number of characters
without Unicode mappings, and, if you add our tika-eval jar, you can also
get an "out of vocabulary" statistic that is an indicator that extracted
text is garbage. Happy to chat over on user@tika.apache.org on either of
those topics.

Would be interesting to see if veraPDF is also extracting unmapped Unicode
chars...missing/broken fonts etc.

On Tue, Jun 27, 2023 at 11:30 AM Susan Borda <sb...@umich.edu> wrote:

> Thanks Tillman, exactly the info I needed.
>
> On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
> > Hi,
> > PDFBox preflight only checks for PDF/A-1b, not for any accessibility
> > topics. Maybe your PDF isn't meant to be accessible to prevent scraping.
> > Try https://verapdf.org/
> > Tilman
> >
> > On 26.06.2023 19:36, Susan Borda wrote:
> > > Hi All-
> > > I'd like to check PDFs that have character encoding issues, does
> > Preflight
> > > do that? I checked the accessibility of a pdf file in Adobe Pro and it
> > gave
> > > me a "Character encoding -Failed" message. When I checked this same
> file
> > in
> > > Preflight I got this:
> > >
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b
> file
> > >
> > > When I try to copy/paste the text from this PDF it's all garbage and
> the
> > > CMap is missing.
> > >
> > > Any advice would be greatly appreciated.
> > > Thanks,
> > > susan
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
> >
>
> --
> Susan Borda
> Digital Preservation Projects Manager
> Digital Preservation Unit
> University of Michigan Libraries
> Buhr Building
> sborda@umich.edu
> *My office phone number is temporarily disconnected while I work remotely
> due to COVID-19. Please contact me via email.*
>

Re: Does preflight check for "character encoding"?

Posted by Susan Borda <sb...@umich.edu>.
Thanks Tillman, exactly the info I needed.

On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi,
> PDFBox preflight only checks for PDF/A-1b, not for any accessibility
> topics. Maybe your PDF isn't meant to be accessible to prevent scraping.
> Try https://verapdf.org/
> Tilman
>
> On 26.06.2023 19:36, Susan Borda wrote:
> > Hi All-
> > I'd like to check PDFs that have character encoding issues, does
> Preflight
> > do that? I checked the accessibility of a pdf file in Adobe Pro and it
> gave
> > me a "Character encoding -Failed" message. When I checked this same file
> in
> > Preflight I got this:
> >
> > Jun 26, 2023 1:24:41 PM
> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > ensureDisplayProfile
> > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > Jun 26, 2023 1:24:41 PM
> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > ensureDisplayProfile
> > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > Jun 26, 2023 1:24:41 PM
> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > ensureDisplayProfile
> > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b file
> >
> > When I try to copy/paste the text from this PDF it's all garbage and the
> > CMap is missing.
> >
> > Any advice would be greatly appreciated.
> > Thanks,
> > susan
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

-- 
Susan Borda
Digital Preservation Projects Manager
Digital Preservation Unit
University of Michigan Libraries
Buhr Building
sborda@umich.edu
*My office phone number is temporarily disconnected while I work remotely
due to COVID-19. Please contact me via email.*

Re: Does preflight check for "character encoding"?

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
PDFBox preflight only checks for PDF/A-1b, not for any accessibility 
topics. Maybe your PDF isn't meant to be accessible to prevent scraping. 
Try https://verapdf.org/
Tilman

On 26.06.2023 19:36, Susan Borda wrote:
> Hi All-
> I'd like to check PDFs that have character encoding issues, does Preflight
> do that? I checked the accessibility of a pdf file in Adobe Pro and it gave
> me a "Character encoding -Failed" message. When I checked this same file in
> Preflight I got this:
>
> Jun 26, 2023 1:24:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> ensureDisplayProfile
> WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> Jun 26, 2023 1:24:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> ensureDisplayProfile
> WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> Jun 26, 2023 1:24:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> ensureDisplayProfile
> WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b file
>
> When I try to copy/paste the text from this PDF it's all garbage and the
> CMap is missing.
>
> Any advice would be greatly appreciated.
> Thanks,
> susan



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org