You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Mirko Raner <mi...@raner.ws> on 2011/09/01 02:56:27 UTC
Re: PDF with strange extracted text
These are most likely ligatures in the original PDF. Ligatures for fi, fl,
ffl, and ft are pretty common, and some word processing programs
automatically replace the original character sequences by their
corresponding ligatures. I haven't really seen a Th ligature before, but it
makes sense because the vertical bar of the T and the vertical bar of the h
typically appear visually too far apart without custom kerning.
HTH,
Mirko
On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <he...@gmail.com> wrote:
> Hello ,
>
> I have a PDF that I extract its text using PDFBox. The PDF is read fine
> using Mac's Preview, but in PDFBox some words are read in a strange way.
> Examples:
> crucifixion => cruci<xion
> They => +ey
> after => a>er
>
> You can check a 1 page PDF sample here :
> http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html
>
> Is this something with the PDF or it concerns PDFBox ?
>
>
> Best regards ,
> Hesham
Re: PDF with strange extracted text
Posted by "Hesham G." <he...@gmail.com>.
Andreas ,
Thanks for the explanation.
Best regards ,
Hesham
---------------------------------------------
Included message :
> Hi,
>
> Am 01.09.2011 05:50, schrieb Hesham G.:
>> Mirko ,
>>
>> Thanks a lot for your reply.
>> Shouldn't PDFBox handle those ligatures automatically, as stated in the
>> previous
>> PDFBox versions ?
> Yes, but only if these could be recognized as ligatures. There is one font
> in
> your pdf using a custom encoding and I guess it doesn't provide a mapping
> for
> readable characters. Even the acrobat reader can't extract those
> ligatures.
> IMHO it's impossible to extract those kind of text without using some
> pdf2image/ocr-stuff which was already discussed theorectically on this
> list.
>
>> Best regards ,
>> Hesham
>>
>>
>> ---------------------------------------------
>> Included message :
>>
>>
>>> These are most likely ligatures in the original PDF. Ligatures for fi,
>>> fl,
>>> ffl, and ft are pretty common, and some word processing programs
>>> automatically replace the original character sequences by their
>>> corresponding ligatures. I haven't really seen a Th ligature before, but
>>> it
>>> makes sense because the vertical bar of the T and the vertical bar of
>>> the h
>>> typically appear visually too far apart without custom kerning.
>>>
>>> HTH,
>>>
>>> Mirko
>>>
>>>
>>> On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <he...@gmail.com>
>>> wrote:
>>>
>>>> Hello ,
>>>>
>>>> I have a PDF that I extract its text using PDFBox. The PDF is read fine
>>>> using Mac's Preview, but in PDFBox some words are read in a strange
>>>> way.
>>>> Examples:
>>>> crucifixion => cruci<xion
>>>> They => +ey
>>>> after => a>er
>>>>
>>>> You can check a 1 page PDF sample here :
>>>> http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html
>>>>
>>>> Is this something with the PDF or it concerns PDFBox ?
>>>>
>>>>
>>>> Best regards ,
>>>> Hesham
>>>
>
> BR
> Andreas Lehmkühler
>
Re: PDF with strange extracted text
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,
Am 01.09.2011 05:50, schrieb Hesham G.:
> Mirko ,
>
> Thanks a lot for your reply.
> Shouldn't PDFBox handle those ligatures automatically, as stated in the previous
> PDFBox versions ?
Yes, but only if these could be recognized as ligatures. There is one font in
your pdf using a custom encoding and I guess it doesn't provide a mapping for
readable characters. Even the acrobat reader can't extract those ligatures.
IMHO it's impossible to extract those kind of text without using some
pdf2image/ocr-stuff which was already discussed theorectically on this list.
> Best regards ,
> Hesham
>
>
> ---------------------------------------------
> Included message :
>
>
>> These are most likely ligatures in the original PDF. Ligatures for fi, fl,
>> ffl, and ft are pretty common, and some word processing programs
>> automatically replace the original character sequences by their
>> corresponding ligatures. I haven't really seen a Th ligature before, but it
>> makes sense because the vertical bar of the T and the vertical bar of the h
>> typically appear visually too far apart without custom kerning.
>>
>> HTH,
>>
>> Mirko
>>
>>
>> On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <he...@gmail.com> wrote:
>>
>>> Hello ,
>>>
>>> I have a PDF that I extract its text using PDFBox. The PDF is read fine
>>> using Mac's Preview, but in PDFBox some words are read in a strange way.
>>> Examples:
>>> crucifixion => cruci<xion
>>> They => +ey
>>> after => a>er
>>>
>>> You can check a 1 page PDF sample here :
>>> http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html
>>>
>>> Is this something with the PDF or it concerns PDFBox ?
>>>
>>>
>>> Best regards ,
>>> Hesham
>>
BR
Andreas Lehmkühler
Re: PDF with strange extracted text
Posted by "Hesham G." <he...@gmail.com>.
Mirko ,
Thanks a lot for your reply.
Shouldn't PDFBox handle those ligatures automatically, as stated in the
previous PDFBox versions ?
Best regards ,
Hesham
---------------------------------------------
Included message :
> These are most likely ligatures in the original PDF. Ligatures for fi, fl,
> ffl, and ft are pretty common, and some word processing programs
> automatically replace the original character sequences by their
> corresponding ligatures. I haven't really seen a Th ligature before, but
> it
> makes sense because the vertical bar of the T and the vertical bar of the
> h
> typically appear visually too far apart without custom kerning.
>
> HTH,
>
> Mirko
>
>
> On Wed, Aug 31, 2011 at 12:59 PM, Hesham G. <he...@gmail.com>
> wrote:
>
>> Hello ,
>>
>> I have a PDF that I extract its text using PDFBox. The PDF is read fine
>> using Mac's Preview, but in PDFBox some words are read in a strange way.
>> Examples:
>> crucifixion => cruci<xion
>> They => +ey
>> after => a>er
>>
>> You can check a 1 page PDF sample here :
>> http://www.4shared.com/document/F5DG_rHu/pdf_with_strange_text.html
>>
>> Is this something with the PDF or it concerns PDFBox ?
>>
>>
>> Best regards ,
>> Hesham
>