You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Hesham G." <he...@gmail.com> on 2010/01/02 08:49:45 UTC
Re: Problem extracting text in Enter chars
Anyone can help me about this issue ?
There is another notice ... A phrase "A Worldly" in the same line in the PDF was extracted also as "AWorldly" without space !!
You can check it in this file :
http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html
Best regards ,
Hesham
Re: Problem extracting text in Enter chars
Posted by "Hesham G." <he...@gmail.com>.
>> you might simply insert a space wherever you find two consecutive upper
>> case letters which are painted in boldface font.
Emmm, doing so will make extraction much slower, but how can I know if text
is painted in bold ?
Best regards ,
Hesham
--------------------------------------
> Hello there,
>
>>
>> There is another notice ... A phrase "A Worldly" in the same line in the
>> PDF was extracted also as "AWorldly" without space !!
>> You can check it in this file :
>> http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html
>>
>
> The phrase "A Worldly" occurs in the title section of the article and
> is painted using a boldface font.
>
> To my knowledge, PDFBox is not very sophisticated and uses the same
> word separation detection algorithm with all normal|italic|boldface
> fonts. However, as this issue demonstrates, it might be justified to
> tweak some threshold values etc. in a font dependent manner.
>
> In the mean time, to overcome this particular problem, you might
> simply insert a space wherever you find two consecutive upper case
> letters which are painted in boldface font.
>
>
> VR
>
Re: Problem extracting text in Enter chars
Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,
>
> There is another notice ... A phrase "A Worldly" in the same line in the PDF was extracted also as "AWorldly" without space !!
> You can check it in this file :
> http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html
>
The phrase "A Worldly" occurs in the title section of the article and
is painted using a boldface font.
To my knowledge, PDFBox is not very sophisticated and uses the same
word separation detection algorithm with all normal|italic|boldface
fonts. However, as this issue demonstrates, it might be justified to
tweak some threshold values etc. in a font dependent manner.
In the mean time, to overcome this particular problem, you might
simply insert a space wherever you find two consecutive upper case
letters which are painted in boldface font.
VR