You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Re...@flagstar.com on 2010/03/01 21:15:49 UTC

Re: PDFTextStripper.processTextPosition

Hi Villu Ruusmann,

Do you think disabling "character spacing" will be made little easier, 
like setting a property or passing a value to a method, in the later 
versions of PDFBox? Since the method you have suggesting to change does a 
lot of things, I am hesitant to override it.

Please let me know. Thank you.

Regards,
Rekha




From:
Villu Ruusmann <vi...@gmail.com>
To:
Rekha.Hariramakrishnan@flagstar.com
Cc:
users@pdfbox.apache.org
Date:
02/19/2010 01:18 PM
Subject:
Re: PDFTextStripper.processTextPosition



Hello there,

>
> And about your example, you are saying that "Hello World" would result 
in two invocations.
> But 1.0 results in 10 or 11 invocations - once for each character.
>

Your PDF document contains a "character spacing" instruction, which
states that all characters should be painted away from each other.
Like this - 
"H"(0.01)"e"(0.01)"l"(0.01)"l"(0.01)"o"(10.0)"W"(0.01)"o"(0.01)"r"(0.01)"d".
PDFBox 0.8.0 did not honour this instruction, but PDFBox 1.0.X does. I
must admit that this is annoying when dealing with small "character
spacing" values (< 0.1).

> Anyway, it is not that I should be able use processTextPosition method 
to do my job.
> What I am trying to say is - if you understood my goal is - I should be 
able to say what the
>"quality of Construction" was for "comparable sale #1" in the image I 
sent you before,
> then may be you could tell me if there is a way to do that with PDFBox.
>

I looked it up from the image - the bounding box of that cell is
[x=610, y=520, width=180, height=30].

You can use class PDFTextStripperByArea instead of PDFTextStripper:

PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
textStripper.addRegion("CS1-QoC", new Rectangle2D.Float(610, 520, 180,
30)); // Define the symbolic name and the bounding box of the field
.. // Add more fields as needed
textStripper.extractRegions(pdfPage);
String qualityOfConstrForCompSale1 =
textStripper.getTextForRegion("CS1-QoC"); // Retrieve the value of the
field by the symbolic name

>
> I was able to do that with version 0.8. Is there a way to set a 
particular value to Tc, Tw, Tj etc
> so that It would behave the way it did before. Just like I have the 
option to set the
> "setWordSeparator", "setLineSeparator" and "setPageSeparator" to "" - 
effectively ignoring word
> separation, lineseparation and pageseparation respectively for 
PDFTextStripper.writeText.
>

You could modify class org.apache.pdfbox.util.PDFStreamEngine to suit
your needs. If I'm not mistaken, then the logic which controls the
processing of characters is located on lines 481-484 (as of SVN
revision 908338). If you want to disable "character spacing", delete
the equality expression "spacingText == 0". If you want to make it
less sensitive, substitute "0" with something greater such as "0.1".


VR




This e-mail may contain data that is confidential, proprietary or
non-public personal information, as that term is defined in the
Gramm-Leach-Bliley Act (collectively, Confidential Information).
The Confidential Information is disclosed conditioned upon your
agreement that you will treat it confidentially and in accordance
with applicable law, ensure that such data isn't used or disclosed
except for the limited purpose for which it's being provided and
will notify and cooperate with us regarding any requested or
unauthorized disclosure or use of any Confidential Information. 
By accepting and reviewing the Confidential information, you agree
to indemnify us against any losses or expenses, including
attorney's fees that we may incur as a result of any unauthorized
use or disclosure of this data due to your acts or omissions. If a
party other than the intended recipient receives this e-mail, he or
she is requested to instantly notify us of the erroneous delivery
and return to us all data so delivered.