You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Re...@flagstar.com on 2010/02/18 18:42:51 UTC
PDFTextStripper.processTextPosition
Hello,
I was using pdfbox 0.8 version and
PDFTextStripper.processTextPosition(TextPosition text) was called for
every "field"???. With 1.0 it looks like it is calling it for every
character. Could you please tell me how to get it to call only on every
"field". Thank you.
Example:
With 0.8 version
text.getYDirAdj(), text.getXDirAdj(), text.getCharacter()
85.44,28.32,There are
85.44,85.92,comparable sales in the subject neighborhood within the past
twelve months ranging in sale price from $
With 1.0 version
text.getYDirAdj(), text.getXDirAdj(), text.getCharacter()
511.68,157.7402,r
511.68,159.88484,a
511.68,163.43492,n
Regards,
Rekha
This e-mail may contain data that is confidential, proprietary or
non-public personal information, as that term is defined in the
Gramm-Leach-Bliley Act (collectively, Confidential Information).
The Confidential Information is disclosed conditioned upon your
agreement that you will treat it confidentially and in accordance
with applicable law, ensure that such data isn't used or disclosed
except for the limited purpose for which it's being provided and
will notify and cooperate with us regarding any requested or
unauthorized disclosure or use of any Confidential Information.
By accepting and reviewing the Confidential information, you agree
to indemnify us against any losses or expenses, including
attorney's fees that we may incur as a result of any unauthorized
use or disclosure of this data due to your acts or omissions. If a
party other than the intended recipient receives this e-mail, he or
she is requested to instantly notify us of the erroneous delivery
and return to us all data so delivered.
Re: PDFTextStripper.processTextPosition
Posted by Re...@flagstar.com.
Hi Villu Ruusmann,
Do you think disabling "character spacing" will be made little easier,
like setting a property or passing a value to a method, in the later
versions of PDFBox? Since the method you have suggesting to change does a
lot of things, I am hesitant to override it.
Please let me know. Thank you.
Regards,
Rekha
From:
Villu Ruusmann <vi...@gmail.com>
To:
Rekha.Hariramakrishnan@flagstar.com
Cc:
users@pdfbox.apache.org
Date:
02/19/2010 01:18 PM
Subject:
Re: PDFTextStripper.processTextPosition
Hello there,
>
> And about your example, you are saying that "Hello World" would result
in two invocations.
> But 1.0 results in 10 or 11 invocations - once for each character.
>
Your PDF document contains a "character spacing" instruction, which
states that all characters should be painted away from each other.
Like this -
"H"(0.01)"e"(0.01)"l"(0.01)"l"(0.01)"o"(10.0)"W"(0.01)"o"(0.01)"r"(0.01)"d".
PDFBox 0.8.0 did not honour this instruction, but PDFBox 1.0.X does. I
must admit that this is annoying when dealing with small "character
spacing" values (< 0.1).
> Anyway, it is not that I should be able use processTextPosition method
to do my job.
> What I am trying to say is - if you understood my goal is - I should be
able to say what the
>"quality of Construction" was for "comparable sale #1" in the image I
sent you before,
> then may be you could tell me if there is a way to do that with PDFBox.
>
I looked it up from the image - the bounding box of that cell is
[x=610, y=520, width=180, height=30].
You can use class PDFTextStripperByArea instead of PDFTextStripper:
PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
textStripper.addRegion("CS1-QoC", new Rectangle2D.Float(610, 520, 180,
30)); // Define the symbolic name and the bounding box of the field
.. // Add more fields as needed
textStripper.extractRegions(pdfPage);
String qualityOfConstrForCompSale1 =
textStripper.getTextForRegion("CS1-QoC"); // Retrieve the value of the
field by the symbolic name
>
> I was able to do that with version 0.8. Is there a way to set a
particular value to Tc, Tw, Tj etc
> so that It would behave the way it did before. Just like I have the
option to set the
> "setWordSeparator", "setLineSeparator" and "setPageSeparator" to "" -
effectively ignoring word
> separation, lineseparation and pageseparation respectively for
PDFTextStripper.writeText.
>
You could modify class org.apache.pdfbox.util.PDFStreamEngine to suit
your needs. If I'm not mistaken, then the logic which controls the
processing of characters is located on lines 481-484 (as of SVN
revision 908338). If you want to disable "character spacing", delete
the equality expression "spacingText == 0". If you want to make it
less sensitive, substitute "0" with something greater such as "0.1".
VR
This e-mail may contain data that is confidential, proprietary or
non-public personal information, as that term is defined in the
Gramm-Leach-Bliley Act (collectively, Confidential Information).
The Confidential Information is disclosed conditioned upon your
agreement that you will treat it confidentially and in accordance
with applicable law, ensure that such data isn't used or disclosed
except for the limited purpose for which it's being provided and
will notify and cooperate with us regarding any requested or
unauthorized disclosure or use of any Confidential Information.
By accepting and reviewing the Confidential information, you agree
to indemnify us against any losses or expenses, including
attorney's fees that we may incur as a result of any unauthorized
use or disclosure of this data due to your acts or omissions. If a
party other than the intended recipient receives this e-mail, he or
she is requested to instantly notify us of the erroneous delivery
and return to us all data so delivered.
Re: PDFTextStripper.processTextPosition
Posted by Re...@flagstar.com.
You are right, I am trying the parse that form. The reason I am trying to
use processTextPosition is we will be doing this programmatically, there
will be no one selecting the region. Also we will be extracting the data
from the form generated by different providers which does not look exactly
the same. For eg., the whole page looks kind of squished. I tried the
PDFTextStripperByArea#extractRegions(PDPage), since the position will not
be exactly the same it is causing me to loose data or pick up the data
from the next column.
Is there a way to find the coordinates for
PDFTextStripperByArea#extractRegions(PDPage) columns programmatically to
be more accurate?
From:
Villu Ruusmann <vi...@gmail.com>
To:
Rekha.Hariramakrishnan@flagstar.com
Cc:
users@pdfbox.apache.org
Date:
02/26/2010 02:47 AM
Subject:
Re: PDFTextStripper.processTextPosition
Hello there,
>
> I thought of continuing to use 0.8 version for my purpose for now.
> Hoping I will have the easier way to achieve it in the later versions of
PDFBox.
>
> The reason for this email is, I am having a difference in the data I
receive if I run
> PDFTextStripper.writeText() and if I extend
PDFTextStripper.processTextPosition( ).
> For example, I have attached a one-page pdf I used for this.
It is unclear to me why do you insist using
PDFTextStripper#processTextPosition(TextPosition) to do the job when
there are better alternatives available.
The example document you sent to me is the second page of the Freddie
Mac Form 70 (http://www.freddiemac.com/sell/forms/pdf/70.pdf), which
has a fixed 3-column layout.
In order to extract field values, you need to find out their bounding
boxes. For as long as there is no PDFBox GUI around I suggest you to
use Foxit PDF Editor for that (select an element and open "Property
List" from its context menu). Then, instantiate a
PDFTextStripperByArea and populate it by invoking
PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
Then, process the page by invoking
PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
values by invoking PDFTextStripperByArea#getTextForRegion(String) for
every field. Note that you do not need to override any methods in
class PDFTextStripperByArea - the public API does just fine.
I have attached a sample application (FreddieMacForm70.java) that
extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
Living Area" for all 3 comparable sales. You can add other fields as
needed.
VR
[attachment "FreddieMacForm70.java" deleted by Rekha
Hariramakrishnan/Flagstar_notes]
This e-mail may contain data that is confidential, proprietary or
non-public personal information, as that term is defined in the
Gramm-Leach-Bliley Act (collectively, Confidential Information).
The Confidential Information is disclosed conditioned upon your
agreement that you will treat it confidentially and in accordance
with applicable law, ensure that such data isn't used or disclosed
except for the limited purpose for which it's being provided and
will notify and cooperate with us regarding any requested or
unauthorized disclosure or use of any Confidential Information.
By accepting and reviewing the Confidential information, you agree
to indemnify us against any losses or expenses, including
attorney's fees that we may incur as a result of any unauthorized
use or disclosure of this data due to your acts or omissions. If a
party other than the intended recipient receives this e-mail, he or
she is requested to instantly notify us of the erroneous delivery
and return to us all data so delivered.
Re: PDFTextStripper.processTextPosition
Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,
>
> I thought of continuing to use 0.8 version for my purpose for now.
> Hoping I will have the easier way to achieve it in the later versions of PDFBox.
>
> The reason for this email is, I am having a difference in the data I receive if I run
> PDFTextStripper.writeText() and if I extend PDFTextStripper.processTextPosition( ).
> For example, I have attached a one-page pdf I used for this.
It is unclear to me why do you insist using
PDFTextStripper#processTextPosition(TextPosition) to do the job when
there are better alternatives available.
The example document you sent to me is the second page of the Freddie
Mac Form 70 (http://www.freddiemac.com/sell/forms/pdf/70.pdf), which
has a fixed 3-column layout.
In order to extract field values, you need to find out their bounding
boxes. For as long as there is no PDFBox GUI around I suggest you to
use Foxit PDF Editor for that (select an element and open "Property
List" from its context menu). Then, instantiate a
PDFTextStripperByArea and populate it by invoking
PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
Then, process the page by invoking
PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
values by invoking PDFTextStripperByArea#getTextForRegion(String) for
every field. Note that you do not need to override any methods in
class PDFTextStripperByArea - the public API does just fine.
I have attached a sample application (FreddieMacForm70.java) that
extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
Living Area" for all 3 comparable sales. You can add other fields as
needed.
VR
Re: PDFTextStripper.processTextPosition
Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,
>
> And about your example, you are saying that "Hello World" would result in two invocations.
> But 1.0 results in 10 or 11 invocations - once for each character.
>
Your PDF document contains a "character spacing" instruction, which
states that all characters should be painted away from each other.
Like this - "H"(0.01)"e"(0.01)"l"(0.01)"l"(0.01)"o"(10.0)"W"(0.01)"o"(0.01)"r"(0.01)"d".
PDFBox 0.8.0 did not honour this instruction, but PDFBox 1.0.X does. I
must admit that this is annoying when dealing with small "character
spacing" values (< 0.1).
> Anyway, it is not that I should be able use processTextPosition method to do my job.
> What I am trying to say is - if you understood my goal is - I should be able to say what the
>"quality of Construction" was for "comparable sale #1" in the image I sent you before,
> then may be you could tell me if there is a way to do that with PDFBox.
>
I looked it up from the image - the bounding box of that cell is
[x=610, y=520, width=180, height=30].
You can use class PDFTextStripperByArea instead of PDFTextStripper:
PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
textStripper.addRegion("CS1-QoC", new Rectangle2D.Float(610, 520, 180,
30)); // Define the symbolic name and the bounding box of the field
.. // Add more fields as needed
textStripper.extractRegions(pdfPage);
String qualityOfConstrForCompSale1 =
textStripper.getTextForRegion("CS1-QoC"); // Retrieve the value of the
field by the symbolic name
>
> I was able to do that with version 0.8. Is there a way to set a particular value to Tc, Tw, Tj etc
> so that It would behave the way it did before. Just like I have the option to set the
> "setWordSeparator", "setLineSeparator" and "setPageSeparator" to "" - effectively ignoring word
> separation, lineseparation and pageseparation respectively for PDFTextStripper.writeText.
>
You could modify class org.apache.pdfbox.util.PDFStreamEngine to suit
your needs. If I'm not mistaken, then the logic which controls the
processing of characters is located on lines 481-484 (as of SVN
revision 908338). If you want to disable "character spacing", delete
the equality expression "spacingText == 0". If you want to make it
less sensitive, substitute "0" with something greater such as "0.1".
VR
Re: PDFTextStripper.processTextPosition
Posted by Re...@flagstar.com.
Hello VR,
I agree with you that if we have control over the way we store/exchange
data then it should be xml. But we are forced to accept pdf in our case.
And about your example, you are saying that "Hello World" would result in
two invocations. But 1.0 results in 10 or 11 invocations - once for each
character.
Anyway, it is not that I should be able use processTextPosition method to
do my job. What I am trying to say is - if you understood my goal is - I
should be able to say what the "quality of Construction" was for
"comparable sale #1" in the image I sent you before, then may be you could
tell me if there is a way to do that with PDFBox.
I was able to do that with version 0.8. Is there a way to set a particular
value to Tc, Tw, Tj etc so that It would behave the way it did before.
Just like I have the option to set the "setWordSeparator",
"setLineSeparator" and "setPageSeparator" to "" - effectively ignoring
word separation, lineseparation and pageseparation respectively for
PDFTextStripper.writeText. Appreciate your help.
Rekha
From:
Villu Ruusmann <vi...@gmail.com>
To:
Rekha.Hariramakrishnan@flagstar.com
Cc:
users@pdfbox.apache.org
Date:
02/19/2010 11:21 AM
Subject:
Re: PDFTextStripper.processTextPosition
Hello there,
>
> I read the link you have send me. It is above my understanding of the
PDFs and PDFBoxTextStripper.
> I am trying to parse this content from the PDF. With 0.8, the
PDFTextStripper.processTextPosition()
> was called for every column value(e.g: "Mt. Pleasant, SC 29466-8583").
>
First of all, your assumption that every "field" should result in
exactly one invocation of
PDFTextStripper#processTextPosition(TextPosition) is too naive when it
comes to real-world PDF documents.
Maybe it helps if you consider that there is no such thing as a "white
space" literal in PDF. Imagine a PDF document that prints "Hello
World". When this document is rendered by a conforming PDF software
(for example, Acrobat Reader) then what happens is that the software
first draws the string "Hello", leaves some horizontal space, and then
draws the string "World". When this document is processed with
PDFBox's utilities such as PDFTextStripper, there would be two
invocations of PDFTextStripper#processTextPosition(TextPosition) - the
first for the string "Hello" and the second for the string "World". It
is the responsibility of the application who is consuming those
TextPositions to figure out (by comparing their relative positions on
screen) that they should be combined to yield "Hello World".
> So I thought I will use the getYDirAdj and getXDirAdj methods to sort
them and take the values.
> Now I do not know where each of those column value end. For eg. How will
I know "Mt. Pleasant,
> SC 29466-8583" is from one "field" if I get one character at a time and
setSortByPosition(true) also
> doesn't work with the processTextPosition(). Could you please tell me if
there is a better way of do that.
>
The sample you sent to me revealed a rather complex table structure.
Assuming this is a fixed layout you can obtain "fields" if you define
the bounding box of each cell (x, y, width, height), collect all the
TextPositions that fall into that region, and finally join the
collected TextPositions into the result string. You are correct that
you must use TextPosition#getXDirAdj, #getYDirAdj, #getWidthDirAdj to
do the job.
PDF really isn't a good choice for data storage or exchange. You would
be better off if you could obtain this data in some structured format
such as XML.
VR
This e-mail may contain data that is confidential, proprietary or
non-public personal information, as that term is defined in the
Gramm-Leach-Bliley Act (collectively, Confidential Information).
The Confidential Information is disclosed conditioned upon your
agreement that you will treat it confidentially and in accordance
with applicable law, ensure that such data isn't used or disclosed
except for the limited purpose for which it's being provided and
will notify and cooperate with us regarding any requested or
unauthorized disclosure or use of any Confidential Information.
By accepting and reviewing the Confidential information, you agree
to indemnify us against any losses or expenses, including
attorney's fees that we may incur as a result of any unauthorized
use or disclosure of this data due to your acts or omissions. If a
party other than the intended recipient receives this e-mail, he or
she is requested to instantly notify us of the erroneous delivery
and return to us all data so delivered.
Re: PDFTextStripper.processTextPosition
Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,
>
> I read the link you have send me. It is above my understanding of the PDFs and PDFBoxTextStripper.
> I am trying to parse this content from the PDF. With 0.8, the PDFTextStripper.processTextPosition()
> was called for every column value(e.g: "Mt. Pleasant, SC 29466-8583").
>
First of all, your assumption that every "field" should result in
exactly one invocation of
PDFTextStripper#processTextPosition(TextPosition) is too naive when it
comes to real-world PDF documents.
Maybe it helps if you consider that there is no such thing as a "white
space" literal in PDF. Imagine a PDF document that prints "Hello
World". When this document is rendered by a conforming PDF software
(for example, Acrobat Reader) then what happens is that the software
first draws the string "Hello", leaves some horizontal space, and then
draws the string "World". When this document is processed with
PDFBox's utilities such as PDFTextStripper, there would be two
invocations of PDFTextStripper#processTextPosition(TextPosition) - the
first for the string "Hello" and the second for the string "World". It
is the responsibility of the application who is consuming those
TextPositions to figure out (by comparing their relative positions on
screen) that they should be combined to yield "Hello World".
> So I thought I will use the getYDirAdj and getXDirAdj methods to sort them and take the values.
> Now I do not know where each of those column value end. For eg. How will I know "Mt. Pleasant,
> SC 29466-8583" is from one "field" if I get one character at a time and setSortByPosition(true) also
> doesn't work with the processTextPosition(). Could you please tell me if there is a better way of do that.
>
The sample you sent to me revealed a rather complex table structure.
Assuming this is a fixed layout you can obtain "fields" if you define
the bounding box of each cell (x, y, width, height), collect all the
TextPositions that fall into that region, and finally join the
collected TextPositions into the result string. You are correct that
you must use TextPosition#getXDirAdj, #getYDirAdj, #getWidthDirAdj to
do the job.
PDF really isn't a good choice for data storage or exchange. You would
be better off if you could obtain this data in some structured format
such as XML.
VR
Re: PDFTextStripper.processTextPosition
Posted by Re...@flagstar.com.
Hello VR,
I read the link you have send me. It is above my understanding of the PDFs
and PDFBoxTextStripper. I am trying to parse this content from the PDF.
With 0.8, the PDFTextStripper.processTextPosition() was called for every
column value(e.g: "Mt. Pleasant, SC 29466-8583"). So I thought I will use
the getYDirAdj and getXDirAdj methods to sort them and take the values.
Now I do not know where each of those column value end. For eg. How will I
know "Mt. Pleasant, SC 29466-8583" is from one "field" if I get one
character at a time and setSortByPosition(true) also doesn't work with the
processTextPosition(). Could you please tell me if there is a better way
of do that. Thank you.
Regards,
Rekha
From:
Villu Ruusmann <vi...@gmail.com>
To:
users@pdfbox.apache.org
Date:
02/19/2010 05:42 AM
Subject:
Re: PDFTextStripper.processTextPosition
Hello there,
>
> I was using pdfbox 0.8 version and
> PDFTextStripper.processTextPosition(TextPosition text) was called for
> every "field"???. With 1.0 it looks like it is calling it for every
> character. Could you please tell me how to get it to call only on every
> "field". Thank you.
>
In short, your PDF document contains a "character spacing"
instruction, to which the PDFTextStripper now correctly abides to.
The change is detailed here:
https://issues.apache.org/jira/browse/PDFBOX-520
Since this change didn't have negative impact on the correctness of
the output of PDFTextStripper (quite the contrary!), could you please
elaborate what is the downside of this solution for you? A noticeable
performance degradation?
VR
This e-mail may contain data that is confidential, proprietary or
non-public personal information, as that term is defined in the
Gramm-Leach-Bliley Act (collectively, Confidential Information).
The Confidential Information is disclosed conditioned upon your
agreement that you will treat it confidentially and in accordance
with applicable law, ensure that such data isn't used or disclosed
except for the limited purpose for which it's being provided and
will notify and cooperate with us regarding any requested or
unauthorized disclosure or use of any Confidential Information.
By accepting and reviewing the Confidential information, you agree
to indemnify us against any losses or expenses, including
attorney's fees that we may incur as a result of any unauthorized
use or disclosure of this data due to your acts or omissions. If a
party other than the intended recipient receives this e-mail, he or
she is requested to instantly notify us of the erroneous delivery
and return to us all data so delivered.
Re: PDFTextStripper.processTextPosition
Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,
>
> I was using pdfbox 0.8 version and
> PDFTextStripper.processTextPosition(TextPosition text) was called for
> every "field"???. With 1.0 it looks like it is calling it for every
> character. Could you please tell me how to get it to call only on every
> "field". Thank you.
>
In short, your PDF document contains a "character spacing"
instruction, to which the PDFTextStripper now correctly abides to.
The change is detailed here:
https://issues.apache.org/jira/browse/PDFBOX-520
Since this change didn't have negative impact on the correctness of
the output of PDFTextStripper (quite the contrary!), could you please
elaborate what is the downside of this solution for you? A noticeable
performance degradation?
VR