You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Yogesh <yo...@gmail.com> on 2010/12/05 20:37:42 UTC
Extracting Text from 2 Column PDFs
Hello,
I am extracting text from 2 column PDF documents. For some documents, the
text is extracted properly (column wise). While for some it is extracted
line wise. i.e line of 1st column is merged with line of 2nd column instead
of next line of 1st column.
What might be the reason? How can I correct it?
Thanks,
-Yogesh
Re: Extracting Text from 2 Column PDFs
Posted by Ad...@swmc.com.
There's a nice patch which will improve how PDFBox detects which words go
together which will be released later this month. I believe it's in the
1.4.0-snapshot right now.
FYI, the reason probably is related to how close the columns are to one
another. If they're about the same distance apart as a space, then the
code has no way of knowing if it's the next word or the next column. It
it's a lot of space it should be fine, and if it's somewhere in between,
it may detect it properly or it may not. The recent patch should make
this grey area much more reliable.
----
Thanks,
Adam
From:
Yogesh <yo...@gmail.com>
To:
PDFBox - Mailing List <us...@pdfbox.apache.org>
Date:
12/05/2010 11:39
Subject:
Extracting Text from 2 Column PDFs
Hello,
I am extracting text from 2 column PDF documents. For some documents, the
text is extracted properly (column wise). While for some it is extracted
line wise. i.e line of 1st column is merged with line of 2nd column
instead
of next line of 1st column.
What might be the reason? How can I correct it?
Thanks,
-Yogesh
- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.