You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Hengyu Weng <ap...@gmail.com> on 2024/03/05 10:52:24 UTC

Feature request for filtering TextPosition in PDFTextStripperByArea and PDFTextStripper

Sometimes the watermark will overlap with normal text which we want to
extract, so it would be great if it is possible to insert a filter and skip
some useless TextPositons (e.g. the text of the watermark may have a
rotation). I think the 'writePage' method in 'PDFTextStripper' is an
appropriate place to add this filter, but I found it is difficult to
override this method as it refers to a lot of private members, and
PDFTextStripper extends LegacyPDFStreamEngine, which is a non-public class,
which makes me unable to copy and modify it.

Currently I'm embedding the source code of pdfbox to allow me to modify the
above classes, I believe it would be definitely better if you can
officially add an insert point or some hooks to them.

Thank you.

Re: Feature request for filtering TextPosition in PDFTextStripperByArea and PDFTextStripper

Posted by Tilman Hausherr <TH...@t-online.de>.
I think I did something similar in 2018 that you might use, see the 
FilteredTextStripper class in ExtractText.java . That one only extracts 
text with angle 0.


/**
  * TextStripper that only processes glyphs that have angle 0.
  */
class FilteredTextStripper extends PDFTextStripper
{
     FilteredTextStripper() throws IOException
     {
     }

     @Override
     protected void processTextPosition(TextPosition text)
     {
         int angle = ExtractText.getAngle(text);
         if (angle == 0)
         {
             super.processTextPosition(text);
         }
     }
}



     static int getAngle(TextPosition text)
     {
         // should this become a part of TextPosition?
         Matrix m = text.getTextMatrix().clone();
         m.concatenate(text.getFont().getFontMatrix());
         return (int) 
Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
     }


Tilman


On 05.03.2024 11:52, Hengyu Weng wrote:
> Sometimes the watermark will overlap with normal text which we want to
> extract, so it would be great if it is possible to insert a filter and skip
> some useless TextPositons (e.g. the text of the watermark may have a
> rotation). I think the 'writePage' method in 'PDFTextStripper' is an
> appropriate place to add this filter, but I found it is difficult to
> override this method as it refers to a lot of private members, and
> PDFTextStripper extends LegacyPDFStreamEngine, which is a non-public class,
> which makes me unable to copy and modify it.
>
> Currently I'm embedding the source code of pdfbox to allow me to modify the
> above classes, I believe it would be definitely better if you can
> officially add an insert point or some hooks to them.
>
> Thank you.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org