You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Augusto Ribeiro Silva <ar...@unsilo.com> on 2016/05/31 13:22:17 UTC

Weird spacing in words

Hi all,

I am using PDFBox java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. I get the same result using the PDFBox-App command line util.

The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment (PRM) sys tem can po ten tially ad dress sev eral as pets

I tried to extract text from the same PDF using the pdftotext command line utility it extracts the text correctly:
The establishment of an integrated Partner Relationship Management (PRM) system can potentially address several aspects 

Does somebody have any idea why PDFBox behaves in this way and any tips to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF processing underneath.

Best regards, 
Augusto
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Weird spacing in words

Posted by Augusto Ribeiro Silva <ar...@unsilo.com>.
Hi

Thanks for the help. I tried that fix out in a snapshot of my own and it seems to fix it.

I am afraid I can’t help you with the naming :) Both seem fine but I guess you need to know what it means because it is hard to find out through the name.

Best regards,
Augusto

> On 01 Jun 2016, at 17:52, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Ignore what I wrote yesterday evening. Your content stream excerpt shows that the spaces are already there. Using Adobe Reader shows the same problem. Your file is similar to
> https://issues.apache.org/jira/browse/PDFBOX-3248
> and I just tested the solution I mentioned there, and here's the result:
> 
> ===
> losses equitably and the outcome of the collaboration must be quantifiably
> beneficial to everyone. The objective is to maximise benefits while mini-
> mising costs.
> ===
> 
> What I could do is this: add the logic mentioned in that issue as an option, that is disabled by default. But I won't do it today, because a release is planned. You could use a snapshot, or build yourself.
> 
> Another problem is that I can't come up with a name
> 
> setIgnoreHardSpaces ?
> 
> setFullSpacesHeuristics ?
> 
> Tilman
> 
> 
> 
> 
> Am 01.06.2016 um 13:59 schrieb Augusto Ribeiro Silva:
>> Hi,
>> 
>> Tweaking the parameters didn’t help.
>> Here is a part of the pdf in question - https://dl.dropboxusercontent.com/u/2456015/problem.pdf
>> 
>> Best regards,
>> Augusto
>> 
>>> On 31 May 2016, at 22:44, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>> Looks like a different problem. Assuming you're using the latest version, you might want to try setting
>>> 
>>> PDFTextStripper.setSpacingTolerance()
>>> 
>>> the default is 0.5f
>>> 
>>> So try some values slightly above or below, i.e. 0.4f, 0.6f, etc.
>>> 
>>> another one is
>>> 
>>> setAverageCharTolerance()
>>> 
>>> the default is 0.3f.
>>> 
>>> Tilman
>>> 
>>> Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva:
>>>> Hi,
>>>> 
>>>> PDFDebugger shows the following.
>>>>  (The ) Tj
>>>>   22.7679 0 Td
>>>>   (es t) Tj
>>>>   12.2023 0 Td
>>>>   (ab lis) Tj
>>>>   20.7981 0 Td
>>>>   (h m) Tj
>>>>   14.0054 0 Td
>>>>   (ent ) Tj
>>>>   19.1013 0 Td
>>>>   (of ) Tj
>>>>   14.83369 0 Td
>>>>   (an ) Tj
>>>>   16.0359 0 Td
>>>>   (in te gr) Tj
>>>>   25.72701 0 Td
>>>>   (ate) Tj
>>>>   12.80299 0 Td
>>>>   (d ) Tj
>>>> 
>>>> I am not sure if it is the same problem. I will try to get permission to upload the document somewhere tomorrow.
>>>> 
>>>> Best regards,
>>>> Augusto
>>>> 
>>>>> On 31 May 2016, at 18:23, Tilman Hausherr <TH...@t-online.de> wrote:
>>>>> 
>>>>> Please upload the file somewhere. If you've used PDFDebugger before, have a look here:
>>>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>>>> and then look at your content stream whether it is the same problem.
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
>>>>>> Hi all,
>>>>>> 
>>>>>> I am using PDFBox java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. I get the same result using the PDFBox-App command line util.
>>>>>> 
>>>>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>>>>>> 
>>>>>> I tried to extract text from the same PDF using the pdftotext command line utility it extracts the text correctly:
>>>>>> The establishment of an integrated Partner Relationship Management (PRM) system can potentially address several aspects
>>>>>> 
>>>>>> Does somebody have any idea why PDFBox behaves in this way and any tips to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF processing underneath.
>>>>>> 
>>>>>> Best regards,
>>>>>> Augusto
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Weird spacing in words

Posted by Tilman Hausherr <TH...@t-online.de>.
Ignore what I wrote yesterday evening. Your content stream excerpt shows 
that the spaces are already there. Using Adobe Reader shows the same 
problem. Your file is similar to
https://issues.apache.org/jira/browse/PDFBOX-3248
and I just tested the solution I mentioned there, and here's the result:

===
losses equitably and the outcome of the collaboration must be quantifiably
beneficial to everyone. The objective is to maximise benefits while mini-
mising costs.
===

What I could do is this: add the logic mentioned in that issue as an 
option, that is disabled by default. But I won't do it today, because a 
release is planned. You could use a snapshot, or build yourself.

Another problem is that I can't come up with a name

setIgnoreHardSpaces ?

setFullSpacesHeuristics ?

Tilman




Am 01.06.2016 um 13:59 schrieb Augusto Ribeiro Silva:
> Hi,
>
> Tweaking the parameters didnt help.
> Here is a part of the pdf in question - https://dl.dropboxusercontent.com/u/2456015/problem.pdf
>
> Best regards,
> Augusto
>
>> On 31 May 2016, at 22:44, Tilman Hausherr <TH...@t-online.de> wrote:
>>
>> Looks like a different problem. Assuming you're using the latest version, you might want to try setting
>>
>> PDFTextStripper.setSpacingTolerance()
>>
>> the default is 0.5f
>>
>> So try some values slightly above or below, i.e. 0.4f, 0.6f, etc.
>>
>> another one is
>>
>> setAverageCharTolerance()
>>
>> the default is 0.3f.
>>
>> Tilman
>>
>> Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva:
>>> Hi,
>>>
>>> PDFDebugger shows the following.
>>>   (The ) Tj
>>>    22.7679 0 Td
>>>    (es t) Tj
>>>    12.2023 0 Td
>>>    (ab lis) Tj
>>>    20.7981 0 Td
>>>    (h m) Tj
>>>    14.0054 0 Td
>>>    (ent ) Tj
>>>    19.1013 0 Td
>>>    (of ) Tj
>>>    14.83369 0 Td
>>>    (an ) Tj
>>>    16.0359 0 Td
>>>    (in te gr) Tj
>>>    25.72701 0 Td
>>>    (ate) Tj
>>>    12.80299 0 Td
>>>    (d ) Tj
>>>
>>> I am not sure if it is the same problem. I will try to get permission to upload the document somewhere tomorrow.
>>>
>>> Best regards,
>>> Augusto
>>>
>>>> On 31 May 2016, at 18:23, Tilman Hausherr <TH...@t-online.de> wrote:
>>>>
>>>> Please upload the file somewhere. If you've used PDFDebugger before, have a look here:
>>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>>> and then look at your content stream whether it is the same problem.
>>>>
>>>> Tilman
>>>>
>>>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
>>>>> Hi all,
>>>>>
>>>>> I am using PDFBox java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. I get the same result using the PDFBox-App command line util.
>>>>>
>>>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>>>>>
>>>>> I tried to extract text from the same PDF using the pdftotext command line utility it extracts the text correctly:
>>>>> The establishment of an integrated Partner Relationship Management (PRM) system can potentially address several aspects
>>>>>
>>>>> Does somebody have any idea why PDFBox behaves in this way and any tips to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF processing underneath.
>>>>>
>>>>> Best regards,
>>>>> Augusto
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Weird spacing in words

Posted by Augusto Ribeiro Silva <ar...@unsilo.com>.
Hi,

Tweaking the parameters didn’t help. 
Here is a part of the pdf in question - https://dl.dropboxusercontent.com/u/2456015/problem.pdf

Best regards,
Augusto

> On 31 May 2016, at 22:44, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Looks like a different problem. Assuming you're using the latest version, you might want to try setting
> 
> PDFTextStripper.setSpacingTolerance()
> 
> the default is 0.5f
> 
> So try some values slightly above or below, i.e. 0.4f, 0.6f, etc.
> 
> another one is
> 
> setAverageCharTolerance()
> 
> the default is 0.3f.
> 
> Tilman
> 
> Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva:
>> Hi,
>> 
>> PDFDebugger shows the following.
>>  (The ) Tj
>>   22.7679 0 Td
>>   (es t) Tj
>>   12.2023 0 Td
>>   (ab lis) Tj
>>   20.7981 0 Td
>>   (h m) Tj
>>   14.0054 0 Td
>>   (ent ) Tj
>>   19.1013 0 Td
>>   (of ) Tj
>>   14.83369 0 Td
>>   (an ) Tj
>>   16.0359 0 Td
>>   (in te gr) Tj
>>   25.72701 0 Td
>>   (ate) Tj
>>   12.80299 0 Td
>>   (d ) Tj
>> 
>> I am not sure if it is the same problem. I will try to get permission to upload the document somewhere tomorrow.
>> 
>> Best regards,
>> Augusto
>> 
>>> On 31 May 2016, at 18:23, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>> Please upload the file somewhere. If you've used PDFDebugger before, have a look here:
>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>> and then look at your content stream whether it is the same problem.
>>> 
>>> Tilman
>>> 
>>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
>>>> Hi all,
>>>> 
>>>> I am using PDFBox java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. I get the same result using the PDFBox-App command line util.
>>>> 
>>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>>>> 
>>>> I tried to extract text from the same PDF using the pdftotext command line utility it extracts the text correctly:
>>>> The establishment of an integrated Partner Relationship Management (PRM) system can potentially address several aspects
>>>> 
>>>> Does somebody have any idea why PDFBox behaves in this way and any tips to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF processing underneath.
>>>> 
>>>> Best regards,
>>>> Augusto
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Weird spacing in words

Posted by Tilman Hausherr <TH...@t-online.de>.
Looks like a different problem. Assuming you're using the latest 
version, you might want to try setting

PDFTextStripper.setSpacingTolerance()

the default is 0.5f

So try some values slightly above or below, i.e. 0.4f, 0.6f, etc.

another one is

setAverageCharTolerance()

the default is 0.3f.

Tilman

Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva:
> Hi,
>
> PDFDebugger shows the following.
>   (The ) Tj
>    22.7679 0 Td
>    (es t) Tj
>    12.2023 0 Td
>    (ab lis) Tj
>    20.7981 0 Td
>    (h m) Tj
>    14.0054 0 Td
>    (ent ) Tj
>    19.1013 0 Td
>    (of ) Tj
>    14.83369 0 Td
>    (an ) Tj
>    16.0359 0 Td
>    (in te gr) Tj
>    25.72701 0 Td
>    (ate) Tj
>    12.80299 0 Td
>    (d ) Tj
>
> I am not sure if it is the same problem. I will try to get permission to upload the document somewhere tomorrow.
>
> Best regards,
> Augusto
>
>> On 31 May 2016, at 18:23, Tilman Hausherr <TH...@t-online.de> wrote:
>>
>> Please upload the file somewhere. If you've used PDFDebugger before, have a look here:
>> https://issues.apache.org/jira/browse/PDFBOX-3248
>> and then look at your content stream whether it is the same problem.
>>
>> Tilman
>>
>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
>>> Hi all,
>>>
>>> I am using PDFBox java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. I get the same result using the PDFBox-App command line util.
>>>
>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>>>
>>> I tried to extract text from the same PDF using the pdftotext command line utility it extracts the text correctly:
>>> The establishment of an integrated Partner Relationship Management (PRM) system can potentially address several aspects
>>>
>>> Does somebody have any idea why PDFBox behaves in this way and any tips to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF processing underneath.
>>>
>>> Best regards,
>>> Augusto
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Weird spacing in words

Posted by Augusto Ribeiro Silva <ar...@unsilo.com>.
Hi,

PDFDebugger shows the following.
 (The ) Tj
  22.7679 0 Td
  (es t) Tj
  12.2023 0 Td
  (ab lis) Tj
  20.7981 0 Td
  (h m) Tj
  14.0054 0 Td
  (ent ) Tj
  19.1013 0 Td
  (of ) Tj
  14.83369 0 Td
  (an ) Tj
  16.0359 0 Td
  (in te gr) Tj
  25.72701 0 Td
  (ate) Tj
  12.80299 0 Td
  (d ) Tj

I am not sure if it is the same problem. I will try to get permission to upload the document somewhere tomorrow.

Best regards,
Augusto

> On 31 May 2016, at 18:23, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Please upload the file somewhere. If you've used PDFDebugger before, have a look here:
> https://issues.apache.org/jira/browse/PDFBOX-3248
> and then look at your content stream whether it is the same problem.
> 
> Tilman
> 
> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
>> Hi all,
>> 
>> I am using PDFBox java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. I get the same result using the PDFBox-App command line util.
>> 
>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>> 
>> I tried to extract text from the same PDF using the pdftotext command line utility it extracts the text correctly:
>> The establishment of an integrated Partner Relationship Management (PRM) system can potentially address several aspects
>> 
>> Does somebody have any idea why PDFBox behaves in this way and any tips to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF processing underneath.
>> 
>> Best regards,
>> Augusto
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Weird spacing in words

Posted by Tilman Hausherr <TH...@t-online.de>.
Please upload the file somewhere. If you've used PDFDebugger before, 
have a look here:
https://issues.apache.org/jira/browse/PDFBOX-3248
and then look at your content stream whether it is the same problem.

Tilman

Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
> Hi all,
>
> I am using PDFBox java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. I get the same result using the PDFBox-App command line util.
>
> The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>
> I tried to extract text from the same PDF using the pdftotext command line utility it extracts the text correctly:
> The establishment of an integrated Partner Relationship Management (PRM) system can potentially address several aspects
>
> Does somebody have any idea why PDFBox behaves in this way and any tips to fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF processing underneath.
>
> Best regards,
> Augusto
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org