You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Hengyu Weng <ap...@gmail.com> on 2024/03/05 10:52:24 UTC
Feature request for filtering TextPosition in PDFTextStripperByArea and PDFTextStripper
Sometimes the watermark will overlap with normal text which we want to
extract, so it would be great if it is possible to insert a filter and skip
some useless TextPositons (e.g. the text of the watermark may have a
rotation). I think the 'writePage' method in 'PDFTextStripper' is an
appropriate place to add this filter, but I found it is difficult to
override this method as it refers to a lot of private members, and
PDFTextStripper extends LegacyPDFStreamEngine, which is a non-public class,
which makes me unable to copy and modify it.
Currently I'm embedding the source code of pdfbox to allow me to modify the
above classes, I believe it would be definitely better if you can
officially add an insert point or some hooks to them.
Thank you.
Re: Feature request for filtering TextPosition in PDFTextStripperByArea and PDFTextStripper
Posted by Tilman Hausherr <TH...@t-online.de>.
I think I did something similar in 2018 that you might use, see the
FilteredTextStripper class in ExtractText.java . That one only extracts
text with angle 0.
/**
* TextStripper that only processes glyphs that have angle 0.
*/
class FilteredTextStripper extends PDFTextStripper
{
FilteredTextStripper() throws IOException
{
}
@Override
protected void processTextPosition(TextPosition text)
{
int angle = ExtractText.getAngle(text);
if (angle == 0)
{
super.processTextPosition(text);
}
}
}
static int getAngle(TextPosition text)
{
// should this become a part of TextPosition?
Matrix m = text.getTextMatrix().clone();
m.concatenate(text.getFont().getFontMatrix());
return (int)
Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
}
Tilman
On 05.03.2024 11:52, Hengyu Weng wrote:
> Sometimes the watermark will overlap with normal text which we want to
> extract, so it would be great if it is possible to insert a filter and skip
> some useless TextPositons (e.g. the text of the watermark may have a
> rotation). I think the 'writePage' method in 'PDFTextStripper' is an
> appropriate place to add this filter, but I found it is difficult to
> override this method as it refers to a lot of private members, and
> PDFTextStripper extends LegacyPDFStreamEngine, which is a non-public class,
> which makes me unable to copy and modify it.
>
> Currently I'm embedding the source code of pdfbox to allow me to modify the
> above classes, I believe it would be definitely better if you can
> officially add an insert point or some hooks to them.
>
> Thank you.
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org