You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/12/18 10:44:05 UTC

Issues with extraction content of PDF files

Hi,

I'm indexing some PDF documents in Solr. However, for certain PDF files,
there are chinese text in the documents, but after indexing, what is
indexed in the content is either a series of "??????" or an empty content.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access
via the link here:
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


Regards,
Edwin

Re: Issues with extraction content of PDF files

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Tim,

Thanks for your reply and advice.

I've drop a note to the PDFBox user list too. Will update here also if I
find any solutions from there.

Regards,
Edwin


On 18 December 2015 at 21:28, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Hi Edwin,
>
>   Thank you for reaching out to Tika.  As I mentioned [0], the issue
> appears to be that the pdf file doesn’t contain Unicode mappings for the
> characters in the document.  This means that PDFBox has no way of
> converting character codes within the PDF into anything useful.  I checked
> with pdftotext, and it also didn’t pull out anything useful.
>
>    I’m not a PDF expert, and you may want to drop a note to the PDFBox
> users list to see if someone there might have a workaround/solution.
>
>
>
>                Best,
>
>
>
>                        Tim
>
>
>
>
>
> [0]
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3CBY2PR09MB11297223E13E266CFB2A5FFC7E00@BY2PR09MB112.namprd09.prod.outlook.com%3E
>
>
>
> *From:* Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com]
> *Sent:* Friday, December 18, 2015 4:44 AM
> *To:* user@tika.apache.org
> *Subject:* Issues with extraction content of PDF files
>
>
>
> Hi,
>
>
>
> I'm indexing some PDF documents in Solr. However, for certain PDF files,
> there are chinese text in the documents, but after indexing, what is
> indexed in the content is either a series of "??????" or an empty content.
>
>
>
> What could be the reason that causes this?
>
>
>
> I've shared one of the file with the issue on dropbox, which you can
> access via the link here:
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>
>
>
>
>
> Regards,
>
> Edwin
>

RE: Issues with extraction content of PDF files

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Hi Edwin,
  Thank you for reaching out to Tika.  As I mentioned [0], the issue appears to be that the pdf file doesn’t contain Unicode mappings for the characters in the document.  This means that PDFBox has no way of converting character codes within the PDF into anything useful.  I checked with pdftotext, and it also didn’t pull out anything useful.
   I’m not a PDF expert, and you may want to drop a note to the PDFBox users list to see if someone there might have a workaround/solution.

               Best,

                       Tim


[0] http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3CBY2PR09MB11297223E13E266CFB2A5FFC7E00@BY2PR09MB112.namprd09.prod.outlook.com%3E

From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com]
Sent: Friday, December 18, 2015 4:44 AM
To: user@tika.apache.org
Subject: Issues with extraction content of PDF files

Hi,

I'm indexing some PDF documents in Solr. However, for certain PDF files, there are chinese text in the documents, but after indexing, what is indexed in the content is either a series of "??????" or an empty content.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access via the link here: https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


Regards,
Edwin