You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Christoph Straßer (JIRA)" <ji...@apache.org> on 2013/08/08 11:14:47 UTC

[jira] [Created] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

Christoph Straßer created SOLR-5124:
---------------------------------------

Summary: Solr glues word´s when parsing PDFs under certan circumstances
Key: SOLR-5124
URL: https://issues.apache.org/jira/browse/SOLR-5124
Project: Solr
Issue Type: Bug
Components: update
Affects Versions: 4.4
Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor

For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this wird documents. This results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org