You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marcello Lorenzi <ml...@sorint.it> on 2013/11/15 17:16:47 UTC

PDF indexing issues

Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors 
occurred for PDF indexing:

ERROR - 2013-11-15 15:14:26.248; 
org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse 
predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
ERROR - 2013-11-15 15:14:36.108; 
org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse 
predefined CMAP file for '--UCS2'

and

ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter; 
FlateFilter: stop reading corrupt stream due to a DataFormatException

Could these errors related to PDF  files format?

Thanks,
Marcello

Re: PDF indexing issues

Posted by Marcello Lorenzi <ml...@sorint.it>.
Hi,
I have checked the PDF Jira issue but there isn't solution into this 
because some users experienced the same issue with different CMAP 
entries. Could it possible to update the PDFBOX library in the SolR 
installation?

Thanks,
Marcello

On 11/15/2013 06:27 PM, Furkan KAMACI wrote:
> You should check the Apache PDFBox project. A similar question:
> https://issues.apache.org/jira/browse/PDFBOX-940
>
>
> 2013/11/15 Marcello Lorenzi <ml...@sorint.it>
>
>> Hi,
>> during you testing of Apache SOLR 4.3, we have noticed some errors
>> occurred for PDF indexing:
>>
>> ERROR - 2013-11-15 15:14:26.248; org.apache.pdfbox.pdmodel.font.PDCIDFont;
>> Error: Could not parse predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
>> ERROR - 2013-11-15 15:14:36.108; org.apache.pdfbox.pdmodel.font.PDCIDFont;
>> Error: Could not parse predefined CMAP file for '--UCS2'
>>
>> and
>>
>> ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter;
>> FlateFilter: stop reading corrupt stream due to a DataFormatException
>>
>> Could these errors related to PDF  files format?
>>
>> Thanks,
>> Marcello
>>


Re: PDF indexing issues

Posted by Furkan KAMACI <fu...@gmail.com>.
You should check the Apache PDFBox project. A similar question:
https://issues.apache.org/jira/browse/PDFBOX-940


2013/11/15 Marcello Lorenzi <ml...@sorint.it>

> Hi,
> during you testing of Apache SOLR 4.3, we have noticed some errors
> occurred for PDF indexing:
>
> ERROR - 2013-11-15 15:14:26.248; org.apache.pdfbox.pdmodel.font.PDCIDFont;
> Error: Could not parse predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
> ERROR - 2013-11-15 15:14:36.108; org.apache.pdfbox.pdmodel.font.PDCIDFont;
> Error: Could not parse predefined CMAP file for '--UCS2'
>
> and
>
> ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter;
> FlateFilter: stop reading corrupt stream due to a DataFormatException
>
> Could these errors related to PDF  files format?
>
> Thanks,
> Marcello
>