You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Joel Hirsh <jo...@gmail.com> on 2016/03/29 17:24:43 UTC
Spacing problem with this pdf file
I have a couple of PDF files that have this problem. These are multi-page
PDF files, and on one page (the first) there are a few lines that get extra
spaces between almost every character as seen from PrintTextLocations.
Attached is a snippet from one of those files, the first line has the
problem, the second line does not.
In this file, the first line gets a string that is
0 3- 09 3 ,4 1 1. 6 9 EL E CT R ON I C D EP O SI T
F DM S -S E TT L EM E NT D E PO S IT 37 6 24 9 46 2 99 9
While the second line gets the text without any extra spaces.
The two lines also have different spacing values as reported by
PrintTextLocations. In the full file, all the good lines have one value,
the bad lines a different value.
I cannot see any difference between the lines in Acrobat, doing copy/paste,
Nitro editing.
This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
older versions I tried as well (i.e. I don't think it is any kind of
regression)
Thanks
Re: Spacing problem with this pdf file
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
If all your files are like that, just dump the space, and make your
extraction on positions only. There is no guarantee that there are
spaces in a PDF between two words anyway.
Tilman
Am 29.03.2016 um 19:36 schrieb Joel Hirsh:
> I understand, but is there anything I can do in my code to get the string
> as shown in ExtractText?
>
> I am subclassing PDFTextStripper, similar to what is done
> in PrintTextLocations, and the string coming into writeString(String
> string, List<TextPosition> textPositions) is the one where all the spaces
> occur.
>
> Thanks
>
> On Tue, Mar 29, 2016 at 10:03 AM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Here's what I got with ExtractText command line application:
>>
>> ______
>> ______ 03-09 3,411.69
>> ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT 376249462999
>> 03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT
>> 376249462999
>>
>>
>>
>> However I think I understand the cause of your problem, because there's
>> output like this:
>>
>> String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997
>> width=4.799988]6
>> String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2
>> width=7.200012]
>>
>> i.e. space and a character at the same place. See this content stream:
>>
>> BT
>> 0 0 0 rg
>> /F0 1 Tf
>> 1 0 0 1 29.204 460.096 Tm
>> ( ______ ) Tj
>> 1 0 0 1 29.204 451.096 Tm
>> ( ______ ) Tj
>> /F1 1 Tf
>> 1 0 0 1 29.204 451.096 Tm
>> ( 03-09 3,411.69 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
>> DEPOSIT 376249462999 ) Tj
>> 1 0 0 1 29.204 442.096 Tm
>> ( 03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
>> DEPOSIT 376249462999 ) Tj
>> ET
>>
>> There are two lines that start at the same position 29.204 451.096, one
>> with blanks, one with a text. That is a bug by the creator of the file.
>>
>> Tilman
>>
>>
>> Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
>>
>>> I thought it was attached to the first email, but it is also available at
>>>
>>> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
>>>
>>>
>>> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>> Please upload that file somewhere.
>>>> Tilman
>>>>
>>>>
>>>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>>>>
>>>> I have a couple of PDF files that have this problem. These are
>>>>> multi-page PDF files, and on one page (the first) there are a few lines
>>>>> that get extra spaces between almost every character as seen from
>>>>> PrintTextLocations.
>>>>>
>>>>> Attached is a snippet from one of those files, the first line has the
>>>>> problem, the second line does not.
>>>>>
>>>>> In this file, the first line gets a string that is
>>>>> 0 3- 09 3 ,4 1 1. 6 9 EL E CT R ON I C D EP O SI T
>>>>> F DM S -S E TT L EM E NT D E PO S IT 37 6 24 9 46 2 99
>>>>> 9
>>>>>
>>>>> While the second line gets the text without any extra spaces.
>>>>>
>>>>> The two lines also have different spacing values as reported by
>>>>> PrintTextLocations. In the full file, all the good lines have one
>>>>> value,
>>>>> the bad lines a different value.
>>>>>
>>>>> I cannot see any difference between the lines in Acrobat, doing
>>>>> copy/paste, Nitro editing.
>>>>>
>>>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>>>>> older versions I tried as well (i.e. I don't think it is any kind of
>>>>> regression)
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: Spacing problem with this pdf file
Posted by Joel Hirsh <jo...@gmail.com>.
I understand, but is there anything I can do in my code to get the string
as shown in ExtractText?
I am subclassing PDFTextStripper, similar to what is done
in PrintTextLocations, and the string coming into writeString(String
string, List<TextPosition> textPositions) is the one where all the spaces
occur.
Thanks
On Tue, Mar 29, 2016 at 10:03 AM, Tilman Hausherr <TH...@t-online.de>
wrote:
> Here's what I got with ExtractText command line application:
>
> ______
> ______ 03-09 3,411.69
> ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT 376249462999
> 03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT
> 376249462999
>
>
>
> However I think I understand the cause of your problem, because there's
> output like this:
>
> String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997
> width=4.799988]6
> String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2
> width=7.200012]
>
> i.e. space and a character at the same place. See this content stream:
>
> BT
> 0 0 0 rg
> /F0 1 Tf
> 1 0 0 1 29.204 460.096 Tm
> ( ______ ) Tj
> 1 0 0 1 29.204 451.096 Tm
> ( ______ ) Tj
> /F1 1 Tf
> 1 0 0 1 29.204 451.096 Tm
> ( 03-09 3,411.69 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
> DEPOSIT 376249462999 ) Tj
> 1 0 0 1 29.204 442.096 Tm
> ( 03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
> DEPOSIT 376249462999 ) Tj
> ET
>
> There are two lines that start at the same position 29.204 451.096, one
> with blanks, one with a text. That is a bug by the creator of the file.
>
> Tilman
>
>
> Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
>
>> I thought it was attached to the first email, but it is also available at
>>
>> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
>>
>>
>> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Please upload that file somewhere.
>>>
>>> Tilman
>>>
>>>
>>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>>>
>>> I have a couple of PDF files that have this problem. These are
>>>> multi-page PDF files, and on one page (the first) there are a few lines
>>>> that get extra spaces between almost every character as seen from
>>>> PrintTextLocations.
>>>>
>>>> Attached is a snippet from one of those files, the first line has the
>>>> problem, the second line does not.
>>>>
>>>> In this file, the first line gets a string that is
>>>> 0 3- 09 3 ,4 1 1. 6 9 EL E CT R ON I C D EP O SI T
>>>> F DM S -S E TT L EM E NT D E PO S IT 37 6 24 9 46 2 99
>>>> 9
>>>>
>>>> While the second line gets the text without any extra spaces.
>>>>
>>>> The two lines also have different spacing values as reported by
>>>> PrintTextLocations. In the full file, all the good lines have one
>>>> value,
>>>> the bad lines a different value.
>>>>
>>>> I cannot see any difference between the lines in Acrobat, doing
>>>> copy/paste, Nitro editing.
>>>>
>>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>>>> older versions I tried as well (i.e. I don't think it is any kind of
>>>> regression)
>>>>
>>>> Thanks
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
Re: Spacing problem with this pdf file
Posted by Tilman Hausherr <TH...@t-online.de>.
Here's what I got with ExtractText command line application:
______
______ 03-09 3,411.69
ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT 376249462999
03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
DEPOSIT 376249462999
However I think I understand the cause of your problem, because there's
output like this:
String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997
width=4.799988]6
String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2
width=7.200012]
i.e. space and a character at the same place. See this content stream:
BT
0 0 0 rg
/F0 1 Tf
1 0 0 1 29.204 460.096 Tm
( ______ ) Tj
1 0 0 1 29.204 451.096 Tm
( ______ ) Tj
/F1 1 Tf
1 0 0 1 29.204 451.096 Tm
( 03-09 3,411.69 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
DEPOSIT 376249462999 ) Tj
1 0 0 1 29.204 442.096 Tm
( 03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
DEPOSIT 376249462999 ) Tj
ET
There are two lines that start at the same position 29.204 451.096, one
with blanks, one with a text. That is a bug by the creator of the file.
Tilman
Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
> I thought it was attached to the first email, but it is also available at
>
> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
>
>
> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Please upload that file somewhere.
>>
>> Tilman
>>
>>
>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>>
>>> I have a couple of PDF files that have this problem. These are
>>> multi-page PDF files, and on one page (the first) there are a few lines
>>> that get extra spaces between almost every character as seen from
>>> PrintTextLocations.
>>>
>>> Attached is a snippet from one of those files, the first line has the
>>> problem, the second line does not.
>>>
>>> In this file, the first line gets a string that is
>>> 0 3- 09 3 ,4 1 1. 6 9 EL E CT R ON I C D EP O SI T
>>> F DM S -S E TT L EM E NT D E PO S IT 37 6 24 9 46 2 99 9
>>>
>>> While the second line gets the text without any extra spaces.
>>>
>>> The two lines also have different spacing values as reported by
>>> PrintTextLocations. In the full file, all the good lines have one value,
>>> the bad lines a different value.
>>>
>>> I cannot see any difference between the lines in Acrobat, doing
>>> copy/paste, Nitro editing.
>>>
>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>>> older versions I tried as well (i.e. I don't think it is any kind of
>>> regression)
>>>
>>> Thanks
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: Spacing problem with this pdf file
Posted by Joel Hirsh <jo...@gmail.com>.
I thought it was attached to the first email, but it is also available at
https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <TH...@t-online.de>
wrote:
> Please upload that file somewhere.
>
> Tilman
>
>
> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>
>> I have a couple of PDF files that have this problem. These are
>> multi-page PDF files, and on one page (the first) there are a few lines
>> that get extra spaces between almost every character as seen from
>> PrintTextLocations.
>>
>> Attached is a snippet from one of those files, the first line has the
>> problem, the second line does not.
>>
>> In this file, the first line gets a string that is
>> 0 3- 09 3 ,4 1 1. 6 9 EL E CT R ON I C D EP O SI T
>> F DM S -S E TT L EM E NT D E PO S IT 37 6 24 9 46 2 99 9
>>
>> While the second line gets the text without any extra spaces.
>>
>> The two lines also have different spacing values as reported by
>> PrintTextLocations. In the full file, all the good lines have one value,
>> the bad lines a different value.
>>
>> I cannot see any difference between the lines in Acrobat, doing
>> copy/paste, Nitro editing.
>>
>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>> older versions I tried as well (i.e. I don't think it is any kind of
>> regression)
>>
>> Thanks
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
Re: Spacing problem with this pdf file
Posted by Tilman Hausherr <TH...@t-online.de>.
Please upload that file somewhere.
Tilman
Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
> I have a couple of PDF files that have this problem. These are
> multi-page PDF files, and on one page (the first) there are a few
> lines that get extra spaces between almost every character as seen
> from PrintTextLocations.
>
> Attached is a snippet from one of those files, the first line has the
> problem, the second line does not.
>
> In this file, the first line gets a string that is
> 0 3- 09 3 ,4 1 1. 6 9 EL E CT R ON I C D EP O SI
> T F DM S -S E TT L EM E NT D E PO S IT 37 6 24 9 46 2
> 99 9
>
> While the second line gets the text without any extra spaces.
>
> The two lines also have different spacing values as reported by
> PrintTextLocations. In the full file, all the good lines have one
> value, the bad lines a different value.
>
> I cannot see any difference between the lines in Acrobat, doing
> copy/paste, Nitro editing.
>
> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
> older versions I tried as well (i.e. I don't think it is any kind of
> regression)
>
> Thanks
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org