You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by John Hewson <jo...@jahewson.com> on 2016/01/01 03:43:17 UTC
Re: Issues with extraction content of PDF files
> On 29 Dec 2015, at 00:34, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:
>
> Thanks for your reply Tilman.
>
> Would like to find out, is the content extraction issue of this caused by the Identity-H encoding?
Most likely. Identity-H is basically just "no encoding", so there needs to be a ToUnicode map in order to extract the text (which there isn't).
-- John
> Regards,
> Edwin
>
>
>> On 21 December 2015 at 16:12, Tilman Hausherr <TH...@t-online.de> wrote:
>>> Am 21.12.2015 um 04:08 schrieb Zheng Lin Edwin Yeo:
>>> Thanks for your reply.
>>>
>>> I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
>>> on Adobe Reader then it is not able to extract all the text properly.
>>>
>>> Is there anyway which we can check what type of encoding is used for the
>>> PDF files?
>>
>> Yes, in the font dictionaries, as you can see from this screenshot:
>>
>>
>>
>> However this won't get you the text, obviously.
>>
>> Tilman
>>
>>> Regards,
>>> Edwin
>>>
>>>
>>>
>>>
>>> On 19 December 2015 at 03:07, Tilman Hausherr <TH...@t-online.de> wrote:
>>>
>>>>> Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>>>>>
>>>>> I've shared one of the file with the issue on dropbox, which you can
>>>>> access
>>>>> via the link here:
>>>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>>>
>>>> Adobe Reader is also unable to extract text.
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>
>