You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Michael Dick <mb...@tepper.cmu.edu> on 2013/06/17 23:32:06 UTC

Problem with extracting text from pdf using SortByPosition

Hello,
<mb...@tepper.cmu.edu>

I'm trying to extract text from a pdf (
http://www.oca.state.pa.us/Industry/Electric/elecomp/wpp.pdf). However I'm
having trouble with the way the doc is formatted. With default settings
(sortbyposition false), the last column is not read along with the line.
I'm having more luck with setting sortbyposition true, however that messes
up some of the text (see below).

Is there a way to tweak settings to fix the text when sortbyposition is
true? Or otherwise is there a way to further troubleshoot this?

Thanks so much for any advice!

Michael

For example on page 4
*with SortByPosition true*
*TriEWaegslte  PEennenr gPyower *
*1-87P7r-i9c3e EtoA GCLomE p(9a3r3e -2453)*
*www.trieagletehnrerogyu.cgohm *
*FixedA purigcue:s t 6 3 m1o, n2t0h1 t3erm 7.29 ¢ $36.45 $72.90 $145.80*
*$20 per month *
*for each month *
*remaining in the *
*contract term*

*with SortByPosition false*
*TriEagle Energy*
*1-877-93EAGLE (933-2453)*
*www.trieagleenergy.com*
*Fixed price:  6 month term 7.29 ¢ $36.45 $72.90 $145.80*