You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marcello Lorenzi <ml...@sorint.it> on 2013/11/15 17:16:47 UTC
PDF indexing issues
Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors
occurred for PDF indexing:
ERROR - 2013-11-15 15:14:26.248;
org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse
predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
ERROR - 2013-11-15 15:14:36.108;
org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse
predefined CMAP file for '--UCS2'
and
ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter;
FlateFilter: stop reading corrupt stream due to a DataFormatException
Could these errors related to PDF files format?
Thanks,
Marcello
Re: PDF indexing issues
Posted by Marcello Lorenzi <ml...@sorint.it>.
Hi,
I have checked the PDF Jira issue but there isn't solution into this
because some users experienced the same issue with different CMAP
entries. Could it possible to update the PDFBOX library in the SolR
installation?
Thanks,
Marcello
On 11/15/2013 06:27 PM, Furkan KAMACI wrote:
> You should check the Apache PDFBox project. A similar question:
> https://issues.apache.org/jira/browse/PDFBOX-940
>
>
> 2013/11/15 Marcello Lorenzi <ml...@sorint.it>
>
>> Hi,
>> during you testing of Apache SOLR 4.3, we have noticed some errors
>> occurred for PDF indexing:
>>
>> ERROR - 2013-11-15 15:14:26.248; org.apache.pdfbox.pdmodel.font.PDCIDFont;
>> Error: Could not parse predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
>> ERROR - 2013-11-15 15:14:36.108; org.apache.pdfbox.pdmodel.font.PDCIDFont;
>> Error: Could not parse predefined CMAP file for '--UCS2'
>>
>> and
>>
>> ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter;
>> FlateFilter: stop reading corrupt stream due to a DataFormatException
>>
>> Could these errors related to PDF files format?
>>
>> Thanks,
>> Marcello
>>
Re: PDF indexing issues
Posted by Furkan KAMACI <fu...@gmail.com>.
You should check the Apache PDFBox project. A similar question:
https://issues.apache.org/jira/browse/PDFBOX-940
2013/11/15 Marcello Lorenzi <ml...@sorint.it>
> Hi,
> during you testing of Apache SOLR 4.3, we have noticed some errors
> occurred for PDF indexing:
>
> ERROR - 2013-11-15 15:14:26.248; org.apache.pdfbox.pdmodel.font.PDCIDFont;
> Error: Could not parse predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
> ERROR - 2013-11-15 15:14:36.108; org.apache.pdfbox.pdmodel.font.PDCIDFont;
> Error: Could not parse predefined CMAP file for '--UCS2'
>
> and
>
> ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter;
> FlateFilter: stop reading corrupt stream due to a DataFormatException
>
> Could these errors related to PDF files format?
>
> Thanks,
> Marcello
>