You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Karen Madore <Ka...@ascnet.com> on 2020/09/10 12:12:12 UTC

How to access the getValidCharCnt() method (or equivalent replacement) of PDFTextStripper

Hello,

We are migrating from PDFBox from 1.8.16 to 2.0.21 and am looking for the equivalent method to the getValidCharCnt(). Previously this method was accessible via PDFTextStripper class but now it does not seem to recognize this method and I am unable to find a suitable replacement method. Below is the line of code I am trying to migrate.

_hasText = (_hasText || stripper.getValidCharCnt() > 0) ? true : false;

Imports used are:
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.multipdf.Splitter;

I am new to PDFBox so this might be a newbie question.

Cheers,

Karen Madore


This email and any files transmitted with it are solely intended for the use of the addressee(s) and may contain information that is confidential and privileged. If you receive this email in error, please advise us by return email immediately. Please also disregard the contents of the email, delete it and destroy any copies immediately. Mediagrif Interactive Technologies Inc. and its subsidiaries do not accept liability for the views expressed in the email or for the consequences of any computer viruses that may be transmitted with this email. This email is also subject to copyright. No part of it should be reproduced, adapted or transmitted without the written consent of the copyright owner.

Re: How to access the getValidCharCnt() method (or equivalent replacement) of PDFTextStripper

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

If you only need to know whether it is > 0, then you don't need it at 
all. Because > 0 means you got text.

If you want a count, then extend the stripper and extend showGlyph(). 
Here is how our "big sister project" Apache Tika does it:

     @Override
     protected void showGlyph(Matrix textRenderingMatrix, PDFont font, 
int code, String unicode, Vector displacement) throws IOException
     {
         super.showGlyph(textRenderingMatrix, font, code, unicode, 
displacement);
         if (unicode == null || unicode.isEmpty()) {
             unmappedUnicodeCharsPerPage++;
         }
         totalCharsPerPage++;
     }

Tilman

Am 10.09.2020 um 14:12 schrieb Karen Madore:
> Hello,
>
> We are migrating from PDFBox from 1.8.16 to 2.0.21 and am looking for the equivalent method to the getValidCharCnt(). Previously this method was accessible via PDFTextStripper class but now it does not seem to recognize this method and I am unable to find a suitable replacement method. Below is the line of code I am trying to migrate.
>
> _hasText = (_hasText || stripper.getValidCharCnt() > 0) ? true : false;
>
> Imports used are:
> import org.apache.pdfbox.text.PDFTextStripper;
> import org.apache.pdfbox.multipdf.Splitter;
>
> I am new to PDFBox so this might be a newbie question.
>
> Cheers,
>
> Karen Madore
>
>
> This email and any files transmitted with it are solely intended for the use of the addressee(s) and may contain information that is confidential and privileged. If you receive this email in error, please advise us by return email immediately. Please also disregard the contents of the email, delete it and destroy any copies immediately. Mediagrif Interactive Technologies Inc. and its subsidiaries do not accept liability for the views expressed in the email or for the consequences of any computer viruses that may be transmitted with this email. This email is also subject to copyright. No part of it should be reproduced, adapted or transmitted without the written consent of the copyright owner.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org