You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Joel Hirsh <jo...@gmail.com> on 2016/03/29 17:24:43 UTC

Spacing problem with this pdf file

I have a couple of PDF files that have this problem.  These are multi-page
PDF files, and on one page (the first) there are a few lines that get extra
spaces between almost every character as seen from PrintTextLocations.

Attached is a snippet from one of those files, the first line has the
problem, the second line does not.

In this file, the first line gets a string that is
0 3- 09               3 ,4 1 1. 6 9        EL E CT R ON I C  D EP O SI T
      F DM S -S E TT L EM E NT    D E PO S IT       37 6 24 9 46 2 99 9

While the second line gets the text without any extra spaces.

The two lines also have different spacing values as reported by
PrintTextLocations.  In the full file, all the good lines have one value,
the bad lines a different value.

I cannot see any difference between the lines in Acrobat, doing copy/paste,
Nitro editing.

This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
older versions I tried as well (i.e. I don't think it is any kind of
regression)

Thanks

Re: Spacing problem with this pdf file

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

If all your files are like that, just dump the space, and make your 
extraction on positions only. There is no guarantee that there are 
spaces in a PDF between two words anyway.

Tilman

Am 29.03.2016 um 19:36 schrieb Joel Hirsh:
> I understand, but is there anything I can do in my code to get the string
> as shown in ExtractText?
>
> I am subclassing PDFTextStripper, similar to what is done
> in PrintTextLocations, and the string coming into writeString(String
> string, List<TextPosition> textPositions) is the one where all the spaces
> occur.
>
> Thanks
>
> On Tue, Mar 29, 2016 at 10:03 AM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Here's what I got with ExtractText command line application:
>>
>> ______
>> ______                                          03-09 3,411.69
>> ELECTRONIC DEPOSIT     FDMS-SETTLEMENT  DEPOSIT 376249462999
>>    03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT  DEPOSIT
>> 376249462999
>>
>>
>>
>> However I think I understand the cause of your problem, because there's
>> output like this:
>>
>> String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997
>> width=4.799988]6
>> String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2
>> width=7.200012]
>>
>> i.e. space and a character at the same place. See this content stream:
>>
>> BT
>>    0 0 0 rg
>>    /F0 1 Tf
>>    1 0 0 1 29.204 460.096 Tm
>> ( ______                                         ) Tj
>>    1 0 0 1 29.204 451.096 Tm
>> ( ______                                         ) Tj
>>    /F1 1 Tf
>>    1 0 0 1 29.204 451.096 Tm
>>    (  03-09          3,411.69     ELECTRONIC DEPOSIT FDMS-SETTLEMENT
>> DEPOSIT 376249462999                                    ) Tj
>>    1 0 0 1 29.204 442.096 Tm
>>    (  03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT
>> DEPOSIT 376249462999                                    ) Tj
>> ET
>>
>> There are two lines that start at the same position 29.204 451.096, one
>> with blanks, one with a text. That is a bug by the creator of the file.
>>
>> Tilman
>>
>>
>> Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
>>
>>> I thought it was attached to the first email, but it is also available at
>>>
>>> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
>>>
>>>
>>> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>> Please upload that file somewhere.
>>>> Tilman
>>>>
>>>>
>>>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>>>>
>>>> I have a couple of PDF files that have this problem.  These are
>>>>> multi-page PDF files, and on one page (the first) there are a few lines
>>>>> that get extra spaces between almost every character as seen from
>>>>> PrintTextLocations.
>>>>>
>>>>> Attached is a snippet from one of those files, the first line has the
>>>>> problem, the second line does not.
>>>>>
>>>>> In this file, the first line gets a string that is
>>>>> 0 3- 09               3 ,4 1 1. 6 9        EL E CT R ON I C  D EP O SI T
>>>>>          F DM S -S E TT L EM E NT    D E PO S IT       37 6 24 9 46 2 99
>>>>> 9
>>>>>
>>>>> While the second line gets the text without any extra spaces.
>>>>>
>>>>> The two lines also have different spacing values as reported by
>>>>> PrintTextLocations.  In the full file, all the good lines have one
>>>>> value,
>>>>> the bad lines a different value.
>>>>>
>>>>> I cannot see any difference between the lines in Acrobat, doing
>>>>> copy/paste, Nitro editing.
>>>>>
>>>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>>>>> older versions I tried as well (i.e. I don't think it is any kind of
>>>>> regression)
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Spacing problem with this pdf file

Posted by Joel Hirsh <jo...@gmail.com>.
I understand, but is there anything I can do in my code to get the string
as shown in ExtractText?

I am subclassing PDFTextStripper, similar to what is done
in PrintTextLocations, and the string coming into writeString(String
string, List<TextPosition> textPositions) is the one where all the spaces
occur.

Thanks

On Tue, Mar 29, 2016 at 10:03 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Here's what I got with ExtractText command line application:
>
> ______
> ______                                          03-09 3,411.69
> ELECTRONIC DEPOSIT     FDMS-SETTLEMENT  DEPOSIT 376249462999
>   03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT  DEPOSIT
> 376249462999
>
>
>
> However I think I understand the cause of your problem, because there's
> output like this:
>
> String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997
> width=4.799988]6
> String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2
> width=7.200012]
>
> i.e. space and a character at the same place. See this content stream:
>
> BT
>   0 0 0 rg
>   /F0 1 Tf
>   1 0 0 1 29.204 460.096 Tm
> ( ______                                         ) Tj
>   1 0 0 1 29.204 451.096 Tm
> ( ______                                         ) Tj
>   /F1 1 Tf
>   1 0 0 1 29.204 451.096 Tm
>   (  03-09          3,411.69     ELECTRONIC DEPOSIT FDMS-SETTLEMENT
> DEPOSIT 376249462999                                    ) Tj
>   1 0 0 1 29.204 442.096 Tm
>   (  03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT
> DEPOSIT 376249462999                                    ) Tj
> ET
>
> There are two lines that start at the same position 29.204 451.096, one
> with blanks, one with a text. That is a bug by the creator of the file.
>
> Tilman
>
>
> Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
>
>> I thought it was attached to the first email, but it is also available at
>>
>> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
>>
>>
>> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Please upload that file somewhere.
>>>
>>> Tilman
>>>
>>>
>>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>>>
>>> I have a couple of PDF files that have this problem.  These are
>>>> multi-page PDF files, and on one page (the first) there are a few lines
>>>> that get extra spaces between almost every character as seen from
>>>> PrintTextLocations.
>>>>
>>>> Attached is a snippet from one of those files, the first line has the
>>>> problem, the second line does not.
>>>>
>>>> In this file, the first line gets a string that is
>>>> 0 3- 09               3 ,4 1 1. 6 9        EL E CT R ON I C  D EP O SI T
>>>>         F DM S -S E TT L EM E NT    D E PO S IT       37 6 24 9 46 2 99
>>>> 9
>>>>
>>>> While the second line gets the text without any extra spaces.
>>>>
>>>> The two lines also have different spacing values as reported by
>>>> PrintTextLocations.  In the full file, all the good lines have one
>>>> value,
>>>> the bad lines a different value.
>>>>
>>>> I cannot see any difference between the lines in Acrobat, doing
>>>> copy/paste, Nitro editing.
>>>>
>>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>>>> older versions I tried as well (i.e. I don't think it is any kind of
>>>> regression)
>>>>
>>>> Thanks
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Spacing problem with this pdf file

Posted by Tilman Hausherr <TH...@t-online.de>.
Here's what I got with ExtractText command line application:

______
______                                          03-09 3,411.69    
ELECTRONIC DEPOSIT     FDMS-SETTLEMENT  DEPOSIT 376249462999
   03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT  
DEPOSIT 376249462999



However I think I understand the cause of your problem, because there's 
output like this:

String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997 
width=4.799988]6
String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2 
width=7.200012]

i.e. space and a character at the same place. See this content stream:

BT
   0 0 0 rg
   /F0 1 Tf
   1 0 0 1 29.204 460.096 Tm
( ______                                         ) Tj
   1 0 0 1 29.204 451.096 Tm
( ______                                         ) Tj
   /F1 1 Tf
   1 0 0 1 29.204 451.096 Tm
   (  03-09          3,411.69     ELECTRONIC DEPOSIT FDMS-SETTLEMENT  
DEPOSIT 376249462999                                    ) Tj
   1 0 0 1 29.204 442.096 Tm
   (  03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT  
DEPOSIT 376249462999                                    ) Tj
ET

There are two lines that start at the same position 29.204 451.096, one 
with blanks, one with a text. That is a bug by the creator of the file.

Tilman

Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
> I thought it was attached to the first email, but it is also available at
>
> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
>
>
> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Please upload that file somewhere.
>>
>> Tilman
>>
>>
>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>>
>>> I have a couple of PDF files that have this problem.  These are
>>> multi-page PDF files, and on one page (the first) there are a few lines
>>> that get extra spaces between almost every character as seen from
>>> PrintTextLocations.
>>>
>>> Attached is a snippet from one of those files, the first line has the
>>> problem, the second line does not.
>>>
>>> In this file, the first line gets a string that is
>>> 0 3- 09               3 ,4 1 1. 6 9        EL E CT R ON I C  D EP O SI T
>>>         F DM S -S E TT L EM E NT    D E PO S IT       37 6 24 9 46 2 99 9
>>>
>>> While the second line gets the text without any extra spaces.
>>>
>>> The two lines also have different spacing values as reported by
>>> PrintTextLocations.  In the full file, all the good lines have one value,
>>> the bad lines a different value.
>>>
>>> I cannot see any difference between the lines in Acrobat, doing
>>> copy/paste, Nitro editing.
>>>
>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>>> older versions I tried as well (i.e. I don't think it is any kind of
>>> regression)
>>>
>>> Thanks
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Spacing problem with this pdf file

Posted by Joel Hirsh <jo...@gmail.com>.
I thought it was attached to the first email, but it is also available at

https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0


On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Please upload that file somewhere.
>
> Tilman
>
>
> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>
>> I have a couple of PDF files that have this problem.  These are
>> multi-page PDF files, and on one page (the first) there are a few lines
>> that get extra spaces between almost every character as seen from
>> PrintTextLocations.
>>
>> Attached is a snippet from one of those files, the first line has the
>> problem, the second line does not.
>>
>> In this file, the first line gets a string that is
>> 0 3- 09               3 ,4 1 1. 6 9        EL E CT R ON I C  D EP O SI T
>>        F DM S -S E TT L EM E NT    D E PO S IT       37 6 24 9 46 2 99 9
>>
>> While the second line gets the text without any extra spaces.
>>
>> The two lines also have different spacing values as reported by
>> PrintTextLocations.  In the full file, all the good lines have one value,
>> the bad lines a different value.
>>
>> I cannot see any difference between the lines in Acrobat, doing
>> copy/paste, Nitro editing.
>>
>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>> older versions I tried as well (i.e. I don't think it is any kind of
>> regression)
>>
>> Thanks
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>

Re: Spacing problem with this pdf file

Posted by Tilman Hausherr <TH...@t-online.de>.
Please upload that file somewhere.

Tilman

Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
> I have a couple of PDF files that have this problem.  These are 
> multi-page PDF files, and on one page (the first) there are a few 
> lines that get extra spaces between almost every character as seen 
> from PrintTextLocations.
>
> Attached is a snippet from one of those files, the first line has the 
> problem, the second line does not.
>
> In this file, the first line gets a string that is
> 0 3- 09               3 ,4 1 1. 6 9        EL E CT R ON I C  D EP O SI 
> T         F DM S -S E TT L EM E NT    D E PO S IT       37 6 24 9 46 2 
> 99 9
>
> While the second line gets the text without any extra spaces.
>
> The two lines also have different spacing values as reported by 
> PrintTextLocations.  In the full file, all the good lines have one 
> value, the bad lines a different value.
>
> I cannot see any difference between the lines in Acrobat, doing 
> copy/paste, Nitro editing.
>
> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some 
> older versions I tried as well (i.e. I don't think it is any kind of 
> regression)
>
> Thanks
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org