You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Mirko Raner <mi...@raner.ws> on 2011/09/01 02:56:27 UTC

Re: PDF with strange extracted text

These are most likely ligatures in the original PDF. Ligatures for fi, fl,
ffl, and ft are pretty common, and some word processing programs
automatically replace the original character sequences by their
corresponding ligatures. I haven't really seen a Th ligature before, but it
makes sense because the vertical bar of the T and the vertical bar of the h
typically appear visually too far apart without custom kerning.

HTH,

Mirko


On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <he...@gmail.com> wrote:

> Hello ,
>
> I have a PDF that I extract its text using PDFBox. The PDF is read fine
> using Mac's Preview, but in PDFBox some words are read in a strange way.
> Examples:
> crucifixion => cruci<xion
> They => +ey
> after => a>er
>
> You can check a 1 page PDF sample here :
> http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html
>
> Is this something with the PDF or it concerns PDFBox ?
>
>
> Best regards ,
> Hesham

Re: PDF with strange extracted text

Posted by "Hesham G." <he...@gmail.com>.
Andreas ,

Thanks for the explanation.


Best regards ,
Hesham

---------------------------------------------
Included message :


> Hi,
>
> Am 01.09.2011 05:50, schrieb Hesham G.:
>> Mirko ,
>>
>> Thanks a lot for your reply.
>> Shouldn't PDFBox handle those ligatures automatically, as stated in the 
>> previous
>> PDFBox versions ?
> Yes, but only if these could be recognized as ligatures. There is one font 
> in
> your pdf using a custom encoding and I guess it doesn't provide a mapping 
> for
> readable characters. Even the acrobat reader can't extract those 
> ligatures.
> IMHO it's impossible to extract those kind of text without using some
> pdf2image/ocr-stuff which was already discussed theorectically on this 
> list.
>
>> Best regards ,
>> Hesham
>>
>>
>> ---------------------------------------------
>> Included message :
>>
>>
>>> These are most likely ligatures in the original PDF. Ligatures for fi, 
>>> fl,
>>> ffl, and ft are pretty common, and some word processing programs
>>> automatically replace the original character sequences by their
>>> corresponding ligatures. I haven't really seen a Th ligature before, but 
>>> it
>>> makes sense because the vertical bar of the T and the vertical bar of 
>>> the h
>>> typically appear visually too far apart without custom kerning.
>>>
>>> HTH,
>>>
>>> Mirko
>>>
>>>
>>> On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <he...@gmail.com> 
>>> wrote:
>>>
>>>> Hello ,
>>>>
>>>> I have a PDF that I extract its text using PDFBox. The PDF is read fine
>>>> using Mac's Preview, but in PDFBox some words are read in a strange 
>>>> way.
>>>> Examples:
>>>> crucifixion => cruci<xion
>>>> They => +ey
>>>> after => a>er
>>>>
>>>> You can check a 1 page PDF sample here :
>>>> http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html
>>>>
>>>> Is this something with the PDF or it concerns PDFBox ?
>>>>
>>>>
>>>> Best regards ,
>>>> Hesham
>>>
>
> BR
> Andreas Lehmkühler
> 

Re: PDF with strange extracted text

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 01.09.2011 05:50, schrieb Hesham G.:
> Mirko ,
>
> Thanks a lot for your reply.
> Shouldn't PDFBox handle those ligatures automatically, as stated in the previous
> PDFBox versions ?
Yes, but only if these could be recognized as ligatures. There is one font in
your pdf using a custom encoding and I guess it doesn't provide a mapping for
readable characters. Even the acrobat reader can't extract those ligatures.
IMHO it's impossible to extract those kind of text without using some
pdf2image/ocr-stuff which was already discussed theorectically on this list.

> Best regards ,
> Hesham
>
>
> ---------------------------------------------
> Included message :
>
>
>> These are most likely ligatures in the original PDF. Ligatures for fi, fl,
>> ffl, and ft are pretty common, and some word processing programs
>> automatically replace the original character sequences by their
>> corresponding ligatures. I haven't really seen a Th ligature before, but it
>> makes sense because the vertical bar of the T and the vertical bar of the h
>> typically appear visually too far apart without custom kerning.
>>
>> HTH,
>>
>> Mirko
>>
>>
>> On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <he...@gmail.com> wrote:
>>
>>> Hello ,
>>>
>>> I have a PDF that I extract its text using PDFBox. The PDF is read fine
>>> using Mac's Preview, but in PDFBox some words are read in a strange way.
>>> Examples:
>>> crucifixion => cruci<xion
>>> They => +ey
>>> after => a>er
>>>
>>> You can check a 1 page PDF sample here :
>>> http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html
>>>
>>> Is this something with the PDF or it concerns PDFBox ?
>>>
>>>
>>> Best regards ,
>>> Hesham
>>

BR
Andreas Lehmkühler

Re: PDF with strange extracted text

Posted by "Hesham G." <he...@gmail.com>.
Mirko ,

Thanks a lot for your reply.
Shouldn't PDFBox handle those ligatures automatically, as stated in the 
previous PDFBox versions ?


Best regards ,
Hesham


---------------------------------------------
Included message :


> These are most likely ligatures in the original PDF. Ligatures for fi, fl,
> ffl, and ft are pretty common, and some word processing programs
> automatically replace the original character sequences by their
> corresponding ligatures. I haven't really seen a Th ligature before, but 
> it
> makes sense because the vertical bar of the T and the vertical bar of the 
> h
> typically appear visually too far apart without custom kerning.
>
> HTH,
>
> Mirko
>
>
> On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <he...@gmail.com> 
> wrote:
>
>> Hello ,
>>
>> I have a PDF that I extract its text using PDFBox. The PDF is read fine
>> using Mac's Preview, but in PDFBox some words are read in a strange way.
>> Examples:
>> crucifixion => cruci<xion
>> They => +ey
>> after => a>er
>>
>> You can check a 1 page PDF sample here :
>> http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html
>>
>> Is this something with the PDF or it concerns PDFBox ?
>>
>>
>> Best regards ,
>> Hesham
>