You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by John Hewson <jo...@jahewson.com> on 2016/01/01 03:43:17 UTC

Re: Issues with extraction content of PDF files

> On 29 Dec 2015, at 00:34, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:
> 
> Thanks for your reply Tilman.
> 
> Would like to find out, is the content extraction issue of this caused by the Identity-H encoding?

Most likely. Identity-H is basically just "no encoding", so there needs to be a ToUnicode  map in order to extract the text (which there isn't).

-- John

> Regards,
> Edwin
> 
> 
>> On 21 December 2015 at 16:12, Tilman Hausherr <TH...@t-online.de> wrote:
>>> Am 21.12.2015 um 04:08 schrieb Zheng Lin Edwin Yeo:
>>> Thanks for your reply.
>>> 
>>> I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
>>> on Adobe Reader then it is not able to extract all the text properly.
>>> 
>>> Is there anyway which we can check what type of encoding is used for the
>>> PDF files?
>> 
>> Yes, in the font dictionaries, as you can see from this screenshot:
>> 
>> 
>> 
>> However this won't get you the text, obviously.
>> 
>> Tilman
>> 
>>> Regards,
>>> Edwin
>>> 
>>> 
>>> 
>>> 
>>> On 19 December 2015 at 03:07, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>>>> Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>>>>> 
>>>>> I've shared one of the file with the issue on dropbox, which you can
>>>>> access
>>>>> via the link here:
>>>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>>> 
>>>> Adobe Reader is also unable to extract text.
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>> 
>