You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/12/18 18:57:44 UTC

Issues with extraction content of PDF files

Hi,

I'm indexing some PDF documents in Solr. However, for certain PDF files,
there are chinese text in the documents, but after indexing, what is
indexed in the content is either a series of "??????" or an empty content.

i've also tried on the Tika app, and I get the same results.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access
via the link here:
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


Regards,
Edwin

RE: Issues with extraction content of PDF files

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Colleagues,
  So that you don't have to do the initial diagnosis at least.  From [0]:

>>That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode mapping for CID+71
(71) in font 505Eddc6Arial
>>So, if the file has no Unicode mapping for the font, I doubt they'll be able to fix it.
>>pdftotext is also unable to extract anything useful from the file.

 [0]  http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3CBY2PR09MB11297223E13E266CFB2A5FFC7E00@BY2PR09MB112.namprd09.prod.outlook.com%3E


-----Original Message-----
From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com] 
Sent: Friday, December 18, 2015 12:58 PM
To: users@pdfbox.apache.org
Subject: Issues with extraction content of PDF files

Hi,

I'm indexing some PDF documents in Solr. However, for certain PDF files, there are chinese text in the documents, but after indexing, what is indexed in the content is either a series of "??????" or an empty content.

i've also tried on the Tika app, and I get the same results.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access via the link here:
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


Regards,
Edwin

Re: Issues with extraction content of PDF files

Posted by Tilman Hausherr <TH...@t-online.de>.

Don't know enough about that part myself, the best would be to read 
about it here:

https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

Tilman

Am 29.12.2015 um 09:34 schrieb Zheng Lin Edwin Yeo:
> Thanks for your reply Tilman.
>
> Would like to find out, is the content extraction issue of this caused 
> by the Identity-H encoding?
>
> Regards,
> Edwin
>
>
> On 21 December 2015 at 16:12, Tilman Hausherr <THausherr@t-online.de 
> <ma...@t-online.de>> wrote:
>
>     Am 21.12.2015 um 04:08 schrieb Zheng Lin Edwin Yeo:
>>     Thanks for your reply.
>>
>>     I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
>>     on Adobe Reader then it is not able to extract all the text properly.
>>
>>     Is there anyway which we can check what type of encoding is used for the
>>     PDF files?
>
>     Yes, in the font dictionaries, as you can see from this screenshot:
>
>
>
>     However this won't get you the text, obviously.
>
>     Tilman
>
>>     Regards,
>>     Edwin
>>
>>
>>
>>
>>     On 19 December 2015 at 03:07, Tilman Hausherr<TH...@t-online.de> <ma...@t-online.de>  wrote:
>>
>>>     Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>>>
>>>>     I've shared one of the file with the issue on dropbox, which you can
>>>>     access
>>>>     via the link here:
>>>>     https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>>
>>>     Adobe Reader is also unable to extract text.
>>>
>>>
>>>
>>>     ---------------------------------------------------------------------
>>>     To unsubscribe, e-mail:users-unsubscribe@pdfbox.apache.org
>>>     <ma...@pdfbox.apache.org>
>>>     For additional commands, e-mail:users-help@pdfbox.apache.org <ma...@pdfbox.apache.org>
>>>
>>>
>
>

Re: Issues with extraction content of PDF files

Posted by John Hewson <jo...@jahewson.com>.

> On 29 Dec 2015, at 00:34, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:
> 
> Thanks for your reply Tilman.
> 
> Would like to find out, is the content extraction issue of this caused by the Identity-H encoding?

Most likely. Identity-H is basically just "no encoding", so there needs to be a ToUnicode  map in order to extract the text (which there isn't).

-- John

> Regards,
> Edwin
> 
> 
>> On 21 December 2015 at 16:12, Tilman Hausherr <TH...@t-online.de> wrote:
>>> Am 21.12.2015 um 04:08 schrieb Zheng Lin Edwin Yeo:
>>> Thanks for your reply.
>>> 
>>> I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
>>> on Adobe Reader then it is not able to extract all the text properly.
>>> 
>>> Is there anyway which we can check what type of encoding is used for the
>>> PDF files?
>> 
>> Yes, in the font dictionaries, as you can see from this screenshot:
>> 
>> 
>> 
>> However this won't get you the text, obviously.
>> 
>> Tilman
>> 
>>> Regards,
>>> Edwin
>>> 
>>> 
>>> 
>>> 
>>> On 19 December 2015 at 03:07, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>>>> Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>>>>> 
>>>>> I've shared one of the file with the issue on dropbox, which you can
>>>>> access
>>>>> via the link here:
>>>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>>> 
>>>> Adobe Reader is also unable to extract text.
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>> 
>

Re: Issues with extraction content of PDF files

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Thanks for your reply Tilman.

Would like to find out, is the content extraction issue of this caused by
the Identity-H encoding?

Regards,
Edwin


On 21 December 2015 at 16:12, Tilman Hausherr <TH...@t-online.de> wrote:

> Am 21.12.2015 um 04:08 schrieb Zheng Lin Edwin Yeo:
>
> Thanks for your reply.
>
> I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
> on Adobe Reader then it is not able to extract all the text properly.
>
> Is there anyway which we can check what type of encoding is used for the
> PDF files?
>
>
> Yes, in the font dictionaries, as you can see from this screenshot:
>
>
>
> However this won't get you the text, obviously.
>
> Tilman
>
>
> Regards,
> Edwin
>
>
>
>
> On 19 December 2015 at 03:07, Tilman Hausherr <TH...@t-online.de> <TH...@t-online.de> wrote:
>
>
> Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>
>
> I've shared one of the file with the issue on dropbox, which you can
> access
> via the link here:https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>
> Adobe Reader is also unable to extract text.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>

Re: Issues with extraction content of PDF files

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 21.12.2015 um 04:08 schrieb Zheng Lin Edwin Yeo:
> Thanks for your reply.
>
> I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
> on Adobe Reader then it is not able to extract all the text properly.
>
> Is there anyway which we can check what type of encoding is used for the
> PDF files?

Yes, in the font dictionaries, as you can see from this screenshot:



However this won't get you the text, obviously.

Tilman

>
> Regards,
> Edwin
>
>
>
>
> On 19 December 2015 at 03:07, Tilman Hausherr <TH...@t-online.de> wrote:
>
>> Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>>
>>> I've shared one of the file with the issue on dropbox, which you can
>>> access
>>> via the link here:
>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>
>> Adobe Reader is also unable to extract text.
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>

Re: Issues with extraction content of PDF files

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Thanks for your reply.

I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
on Adobe Reader then it is not able to extract all the text properly.

Is there anyway which we can check what type of encoding is used for the
PDF files?

Regards,
Edwin

On 19 December 2015 at 03:07, Tilman Hausherr <TH...@t-online.de> wrote:

> Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>
>> I've shared one of the file with the issue on dropbox, which you can
>> access
>> via the link here:
>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>
>
> Adobe Reader is also unable to extract text.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Issues with extraction content of PDF files

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
> I've shared one of the file with the issue on dropbox, which you can access
> via the link here:
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0

Adobe Reader is also unable to extract text.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org