You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Hesham G." <he...@gmail.com> on 2010/01/02 08:49:45 UTC

Re: Problem extracting text in Enter chars

Anyone can help me about this issue ?

There is another notice ... A phrase "A Worldly" in the same line in the PDF was extracted also as "AWorldly" without space !!
You can check it in this file :
http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html


Best regards ,
Hesham

Re: Problem extracting text in Enter chars

Posted by "Hesham G." <he...@gmail.com>.
>> you might simply insert a space wherever you find two consecutive upper 
>> case letters which are painted in boldface font.
Emmm, doing so will make extraction much slower, but how can I know if text 
is painted in bold ?

Best regards ,
Hesham


--------------------------------------

> Hello there,
>
>>
>> There is another notice ... A phrase "A Worldly" in the same line in the 
>> PDF was extracted also as "AWorldly" without space !!
>> You can check it in this file :
>> http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html
>>
>
> The phrase "A Worldly" occurs in the title section of the article and
> is painted using a boldface font.
>
> To my knowledge, PDFBox is not very sophisticated and uses the same
> word separation detection algorithm with all normal|italic|boldface
> fonts. However, as this issue demonstrates, it might be justified to
> tweak some threshold values etc. in a font dependent manner.
>
> In the mean time, to overcome this particular problem, you might
> simply insert a space wherever you find two consecutive upper case
> letters which are painted in boldface font.
>
>
> VR
> 

Re: Problem extracting text in Enter chars

Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,

>
> There is another notice ... A phrase "A Worldly" in the same line in the PDF was extracted also as "AWorldly" without space !!
> You can check it in this file :
> http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html
>

The phrase "A Worldly" occurs in the title section of the article and
is painted using a boldface font.

To my knowledge, PDFBox is not very sophisticated and uses the same
word separation detection algorithm with all normal|italic|boldface
fonts. However, as this issue demonstrates, it might be justified to
tweak some threshold values etc. in a font dependent manner.

In the mean time, to overcome this particular problem, you might
simply insert a space wherever you find two consecutive upper case
letters which are painted in boldface font.


VR