You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Gilad Denneboom <gi...@gmail.com> on 2015/08/06 17:49:06 UTC

Major differences between PDFTextStripper and PrintTextLocations

Hi everyone,

I'm looking for advice on a problem I'm encountering where the output of
PDFTextStripper and PrintTextLocations is dramatically different when
processing the same file.
For some reason, the output of PrintTextLocations is 12 times longer than
that of PDFTextStripper, ie the entire text is printed out 12 times,
instead of just once.

I'm attaching the file in question, as well as the output produced using
both methods via Google Drive... Hopefully it will come through.

I'd appreciate any ideas as to what might be causing this issue (I'm
guessing there's something wrong with the structure of the file), and of
course any possible solutions.

Thanks in advance, Gilad.

PS. I'm using 1.8.10.
​
 output problem.zip
<https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web>
​

Re: Major differences between PDFTextStripper and PrintTextLocations

Posted by Gilad Denneboom <gi...@gmail.com>.
OK, thanks for looking into it, any way!

On Mon, Aug 10, 2015 at 6:59 PM, Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Am 10.08.2015 um 18:48 schrieb Gilad Denneboom:
>
>> I guessed it was something like that... Do you think it's because it was
>> generated with iText?
>>
> Sorry, but I don't know anything about the internals of itext or possible
> bugs of older versions
>
> BR
> Andreas
>
>
>
>> On Mon, Aug 10, 2015 at 6:35 PM, Andreas Lehmkuehler <an...@lehmi.de>
>> wrote:
>>
>> Hi,
>>>
>>> Am 10.08.2015 um 13:22 schrieb Gilad Denneboom:
>>>
>>> Hi Andreas,
>>>>
>>>> Of course the output itself is different, but I would expect that the
>>>> underlying text each tool processes would be the same, and it's not.
>>>> Have
>>>> a
>>>> look at the first line in the PrintTextLocations output file:
>>>> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
>>>> width=2.7799988]:
>>>> It is repeated, with exactly the same information, 12 times throughout
>>>> the
>>>> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and
>>>> 991.
>>>>
>>>> Why would the same information be processed 12 times in a single run?
>>>>
>>>> The pdf contains a lot of redundant information, e.g. the header is
>>> repeated several times (I didn't count them but I guess it's 12 times).
>>> PDFTextStripper eliminates overlapping text/characters and
>>> PrintTextLocations doesn't.
>>>
>>> BR
>>> Andreas
>>>
>>>
>>> Gilad
>>>
>>>>
>>>> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <an...@lehmi.de>
>>>> wrote:
>>>>
>>>> Hi Gilad,
>>>>
>>>>>
>>>>> sorry for the late answer ....
>>>>>
>>>>> I'm not sure what you're expecting. You are using 2 totally different
>>>>> approaches
>>>>> to process a pdf. PrintTextLocations provides a lot of additional
>>>>> information
>>>>> for every piece of text, which may vary from one character up to whole
>>>>> words or
>>>>> lines of text. Consequently the output has to be totally different and
>>>>> of
>>>>> course
>>>>> much bigger than the output of a simple text extraction.
>>>>>
>>>>> BR
>>>>> Andreas
>>>>>
>>>>> Gilad Denneboom <gi...@gmail.com> hat am 10. August 2015 um
>>>>>
>>>>>>
>>>>>> 10:05
>>>>>
>>>>> geschrieben:
>>>>>>
>>>>>>
>>>>>> No one has any ideas?
>>>>>>
>>>>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
>>>>>>
>>>>>> gilad.denneboom@gmail.com>
>>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>>>
>>>>>>> I'm looking for advice on a problem I'm encountering where the output
>>>>>>>
>>>>>>> of
>>>>>>
>>>>>
>>>>> PDFTextStripper and PrintTextLocations is dramatically different when
>>>>>>
>>>>>>> processing the same file.
>>>>>>> For some reason, the output of PrintTextLocations is 12 times longer
>>>>>>>
>>>>>>> than
>>>>>>
>>>>>
>>>>> that of PDFTextStripper, ie the entire text is printed out 12 times,
>>>>>>
>>>>>>> instead of just once.
>>>>>>>
>>>>>>> I'm attaching the file in question, as well as the output produced
>>>>>>>
>>>>>>> using
>>>>>>
>>>>>
>>>>> both methods via Google Drive... Hopefully it will come through.
>>>>>>
>>>>>>>
>>>>>>> I'd appreciate any ideas as to what might be causing this issue (I'm
>>>>>>> guessing there's something wrong with the structure of the file), and
>>>>>>>
>>>>>>> of
>>>>>>
>>>>>
>>>>> course any possible solutions.
>>>>>>
>>>>>>>
>>>>>>> Thanks in advance, Gilad.
>>>>>>>
>>>>>>> PS. I'm using 1.8.10.
>>>>>>> ​
>>>>>>>    output problem.zip
>>>>>>> <
>>>>>>>
>>>>>>>
>>>>>>
>>>>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web
>>>>>
>>>>>
>>>>>> ​
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Major differences between PDFTextStripper and PrintTextLocations

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 10.08.2015 um 18:48 schrieb Gilad Denneboom:
> I guessed it was something like that... Do you think it's because it was
> generated with iText?
Sorry, but I don't know anything about the internals of itext or possible bugs 
of older versions

BR
Andreas

>
> On Mon, Aug 10, 2015 at 6:35 PM, Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> Hi,
>>
>> Am 10.08.2015 um 13:22 schrieb Gilad Denneboom:
>>
>>> Hi Andreas,
>>>
>>> Of course the output itself is different, but I would expect that the
>>> underlying text each tool processes would be the same, and it's not. Have
>>> a
>>> look at the first line in the PrintTextLocations output file:
>>> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
>>> width=2.7799988]:
>>> It is repeated, with exactly the same information, 12 times throughout the
>>> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and 991.
>>>
>>> Why would the same information be processed 12 times in a single run?
>>>
>> The pdf contains a lot of redundant information, e.g. the header is
>> repeated several times (I didn't count them but I guess it's 12 times).
>> PDFTextStripper eliminates overlapping text/characters and
>> PrintTextLocations doesn't.
>>
>> BR
>> Andreas
>>
>>
>> Gilad
>>>
>>> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <an...@lehmi.de>
>>> wrote:
>>>
>>> Hi Gilad,
>>>>
>>>> sorry for the late answer ....
>>>>
>>>> I'm not sure what you're expecting. You are using 2 totally different
>>>> approaches
>>>> to process a pdf. PrintTextLocations provides a lot of additional
>>>> information
>>>> for every piece of text, which may vary from one character up to whole
>>>> words or
>>>> lines of text. Consequently the output has to be totally different and of
>>>> course
>>>> much bigger than the output of a simple text extraction.
>>>>
>>>> BR
>>>> Andreas
>>>>
>>>> Gilad Denneboom <gi...@gmail.com> hat am 10. August 2015 um
>>>>>
>>>> 10:05
>>>>
>>>>> geschrieben:
>>>>>
>>>>>
>>>>> No one has any ideas?
>>>>>
>>>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
>>>>>
>>>> gilad.denneboom@gmail.com>
>>>>
>>>>> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>>
>>>>>> I'm looking for advice on a problem I'm encountering where the output
>>>>>>
>>>>> of
>>>>
>>>>> PDFTextStripper and PrintTextLocations is dramatically different when
>>>>>> processing the same file.
>>>>>> For some reason, the output of PrintTextLocations is 12 times longer
>>>>>>
>>>>> than
>>>>
>>>>> that of PDFTextStripper, ie the entire text is printed out 12 times,
>>>>>> instead of just once.
>>>>>>
>>>>>> I'm attaching the file in question, as well as the output produced
>>>>>>
>>>>> using
>>>>
>>>>> both methods via Google Drive... Hopefully it will come through.
>>>>>>
>>>>>> I'd appreciate any ideas as to what might be causing this issue (I'm
>>>>>> guessing there's something wrong with the structure of the file), and
>>>>>>
>>>>> of
>>>>
>>>>> course any possible solutions.
>>>>>>
>>>>>> Thanks in advance, Gilad.
>>>>>>
>>>>>> PS. I'm using 1.8.10.
>>>>>> ​
>>>>>>    output problem.zip
>>>>>> <
>>>>>>
>>>>>
>>>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web
>>>>
>>>>>
>>>>> ​
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Major differences between PDFTextStripper and PrintTextLocations

Posted by Gilad Denneboom <gi...@gmail.com>.
I guessed it was something like that... Do you think it's because it was
generated with iText?

On Mon, Aug 10, 2015 at 6:35 PM, Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Hi,
>
> Am 10.08.2015 um 13:22 schrieb Gilad Denneboom:
>
>> Hi Andreas,
>>
>> Of course the output itself is different, but I would expect that the
>> underlying text each tool processes would be the same, and it's not. Have
>> a
>> look at the first line in the PrintTextLocations output file:
>> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
>> width=2.7799988]:
>> It is repeated, with exactly the same information, 12 times throughout the
>> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and 991.
>>
>> Why would the same information be processed 12 times in a single run?
>>
> The pdf contains a lot of redundant information, e.g. the header is
> repeated several times (I didn't count them but I guess it's 12 times).
> PDFTextStripper eliminates overlapping text/characters and
> PrintTextLocations doesn't.
>
> BR
> Andreas
>
>
> Gilad
>>
>> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <an...@lehmi.de>
>> wrote:
>>
>> Hi Gilad,
>>>
>>> sorry for the late answer ....
>>>
>>> I'm not sure what you're expecting. You are using 2 totally different
>>> approaches
>>> to process a pdf. PrintTextLocations provides a lot of additional
>>> information
>>> for every piece of text, which may vary from one character up to whole
>>> words or
>>> lines of text. Consequently the output has to be totally different and of
>>> course
>>> much bigger than the output of a simple text extraction.
>>>
>>> BR
>>> Andreas
>>>
>>> Gilad Denneboom <gi...@gmail.com> hat am 10. August 2015 um
>>>>
>>> 10:05
>>>
>>>> geschrieben:
>>>>
>>>>
>>>> No one has any ideas?
>>>>
>>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
>>>>
>>> gilad.denneboom@gmail.com>
>>>
>>>> wrote:
>>>>
>>>> Hi everyone,
>>>>>
>>>>> I'm looking for advice on a problem I'm encountering where the output
>>>>>
>>>> of
>>>
>>>> PDFTextStripper and PrintTextLocations is dramatically different when
>>>>> processing the same file.
>>>>> For some reason, the output of PrintTextLocations is 12 times longer
>>>>>
>>>> than
>>>
>>>> that of PDFTextStripper, ie the entire text is printed out 12 times,
>>>>> instead of just once.
>>>>>
>>>>> I'm attaching the file in question, as well as the output produced
>>>>>
>>>> using
>>>
>>>> both methods via Google Drive... Hopefully it will come through.
>>>>>
>>>>> I'd appreciate any ideas as to what might be causing this issue (I'm
>>>>> guessing there's something wrong with the structure of the file), and
>>>>>
>>>> of
>>>
>>>> course any possible solutions.
>>>>>
>>>>> Thanks in advance, Gilad.
>>>>>
>>>>> PS. I'm using 1.8.10.
>>>>> ​
>>>>>   output problem.zip
>>>>> <
>>>>>
>>>>
>>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web
>>>
>>>>
>>>> ​
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Major differences between PDFTextStripper and PrintTextLocations

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 10.08.2015 um 13:22 schrieb Gilad Denneboom:
> Hi Andreas,
>
> Of course the output itself is different, but I would expect that the
> underlying text each tool processes would be the same, and it's not. Have a
> look at the first line in the PrintTextLocations output file:
> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
> width=2.7799988]:
> It is repeated, with exactly the same information, 12 times throughout the
> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and 991.
>
> Why would the same information be processed 12 times in a single run?
The pdf contains a lot of redundant information, e.g. the header is repeated 
several times (I didn't count them but I guess it's 12 times). PDFTextStripper 
eliminates overlapping text/characters and PrintTextLocations doesn't.

BR
Andreas

> Gilad
>
> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <an...@lehmi.de>
> wrote:
>
>> Hi Gilad,
>>
>> sorry for the late answer ....
>>
>> I'm not sure what you're expecting. You are using 2 totally different
>> approaches
>> to process a pdf. PrintTextLocations provides a lot of additional
>> information
>> for every piece of text, which may vary from one character up to whole
>> words or
>> lines of text. Consequently the output has to be totally different and of
>> course
>> much bigger than the output of a simple text extraction.
>>
>> BR
>> Andreas
>>
>>> Gilad Denneboom <gi...@gmail.com> hat am 10. August 2015 um
>> 10:05
>>> geschrieben:
>>>
>>>
>>> No one has any ideas?
>>>
>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
>> gilad.denneboom@gmail.com>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm looking for advice on a problem I'm encountering where the output
>> of
>>>> PDFTextStripper and PrintTextLocations is dramatically different when
>>>> processing the same file.
>>>> For some reason, the output of PrintTextLocations is 12 times longer
>> than
>>>> that of PDFTextStripper, ie the entire text is printed out 12 times,
>>>> instead of just once.
>>>>
>>>> I'm attaching the file in question, as well as the output produced
>> using
>>>> both methods via Google Drive... Hopefully it will come through.
>>>>
>>>> I'd appreciate any ideas as to what might be causing this issue (I'm
>>>> guessing there's something wrong with the structure of the file), and
>> of
>>>> course any possible solutions.
>>>>
>>>> Thanks in advance, Gilad.
>>>>
>>>> PS. I'm using 1.8.10.
>>>> ​
>>>>   output problem.zip
>>>> <
>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web
>>>
>>>> ​
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Major differences between PDFTextStripper and PrintTextLocations

Posted by Gilad Denneboom <gi...@gmail.com>.
Hi Andreas,

Of course the output itself is different, but I would expect that the
underlying text each tool processes would be the same, and it's not. Have a
look at the first line in the PrintTextLocations output file:
String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
width=2.7799988]:
It is repeated, with exactly the same information, 12 times throughout the
output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and 991.

Why would the same information be processed 12 times in a single run?

Gilad

On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <an...@lehmi.de>
wrote:

> Hi Gilad,
>
> sorry for the late answer ....
>
> I'm not sure what you're expecting. You are using 2 totally different
> approaches
> to process a pdf. PrintTextLocations provides a lot of additional
> information
> for every piece of text, which may vary from one character up to whole
> words or
> lines of text. Consequently the output has to be totally different and of
> course
> much bigger than the output of a simple text extraction.
>
> BR
> Andreas
>
> > Gilad Denneboom <gi...@gmail.com> hat am 10. August 2015 um
> 10:05
> > geschrieben:
> >
> >
> > No one has any ideas?
> >
> > On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
> gilad.denneboom@gmail.com>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > I'm looking for advice on a problem I'm encountering where the output
> of
> > > PDFTextStripper and PrintTextLocations is dramatically different when
> > > processing the same file.
> > > For some reason, the output of PrintTextLocations is 12 times longer
> than
> > > that of PDFTextStripper, ie the entire text is printed out 12 times,
> > > instead of just once.
> > >
> > > I'm attaching the file in question, as well as the output produced
> using
> > > both methods via Google Drive... Hopefully it will come through.
> > >
> > > I'd appreciate any ideas as to what might be causing this issue (I'm
> > > guessing there's something wrong with the structure of the file), and
> of
> > > course any possible solutions.
> > >
> > > Thanks in advance, Gilad.
> > >
> > > PS. I'm using 1.8.10.
> > > ​
> > >  output problem.zip
> > > <
> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web
> >
> > > ​
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Major differences between PDFTextStripper and PrintTextLocations

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi Gilad,

sorry for the late answer ....

I'm not sure what you're expecting. You are using 2 totally different approaches
to process a pdf. PrintTextLocations provides a lot of additional information
for every piece of text, which may vary from one character up to whole words or
lines of text. Consequently the output has to be totally different and of course
much bigger than the output of a simple text extraction.

BR
Andreas

> Gilad Denneboom <gi...@gmail.com> hat am 10. August 2015 um 10:05
> geschrieben:
> 
> 
> No one has any ideas?
> 
> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <gi...@gmail.com>
> wrote:
> 
> > Hi everyone,
> >
> > I'm looking for advice on a problem I'm encountering where the output of
> > PDFTextStripper and PrintTextLocations is dramatically different when
> > processing the same file.
> > For some reason, the output of PrintTextLocations is 12 times longer than
> > that of PDFTextStripper, ie the entire text is printed out 12 times,
> > instead of just once.
> >
> > I'm attaching the file in question, as well as the output produced using
> > both methods via Google Drive... Hopefully it will come through.
> >
> > I'd appreciate any ideas as to what might be causing this issue (I'm
> > guessing there's something wrong with the structure of the file), and of
> > course any possible solutions.
> >
> > Thanks in advance, Gilad.
> >
> > PS. I'm using 1.8.10.
> > ​
> >  output problem.zip
> > <https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web>
> > ​
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Major differences between PDFTextStripper and PrintTextLocations

Posted by Gilad Denneboom <gi...@gmail.com>.
No one has any ideas?

On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <gi...@gmail.com>
wrote:

> Hi everyone,
>
> I'm looking for advice on a problem I'm encountering where the output of
> PDFTextStripper and PrintTextLocations is dramatically different when
> processing the same file.
> For some reason, the output of PrintTextLocations is 12 times longer than
> that of PDFTextStripper, ie the entire text is printed out 12 times,
> instead of just once.
>
> I'm attaching the file in question, as well as the output produced using
> both methods via Google Drive... Hopefully it will come through.
>
> I'd appreciate any ideas as to what might be causing this issue (I'm
> guessing there's something wrong with the structure of the file), and of
> course any possible solutions.
>
> Thanks in advance, Gilad.
>
> PS. I'm using 1.8.10.
> ​
>  output problem.zip
> <https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web>
> ​
>