You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alan G Quan <al...@aero.org> on 2016/04/14 00:49:02 UTC

Problem retaining PDF text

I am indexing PDF documents in Solr 5.3.0 like this:
curl "http://localhost:8983/solr/mycore1/update/extract?literal.id=101&commit=true" -F "myfile=@101.pdf".
This works fine and I can search for keywords in the PDF text in Solr and it finds the document correctly.  But when I make any subsequent changes to the Solr record for that document, using atomic updates "set" or "add", the PDF text is lost.  I verified this by searching for the same keyword after the update and the document is not found.  The Solr record for the document with the literal id field value "101" is still there after the update but the text is gone.  Why does Solr delete the PDF text after any update of the record for the document, and is there a way to prevent that?

Regards,
Alan

Re: Problem retaining PDF text

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Atomic update requires to reload the content of all _other_ fields to
reconstruct full document before putting it back into Lucene index.
That's because Lucene does not support 'update' and every update
actually deletes the original and recreates it.

The problem is that your PDF text is probably not stored. So, when you
do the update, it does not form a part of the new document and just
disappears. Changing that to stored should fix the issue, at the cost
of storing untokenized text. If that has performance implications, you
could look at lazy loading fields setting.

Regards,
    Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 14 April 2016 at 08:49, Alan G Quan <al...@aero.org> wrote:
> I am indexing PDF documents in Solr 5.3.0 like this:
> curl "http://localhost:8983/solr/mycore1/update/extract?literal.id=101&commit=true" -F "myfile=@101.pdf".
> This works fine and I can search for keywords in the PDF text in Solr and it finds the document correctly.  But when I make any subsequent changes to the Solr record for that document, using atomic updates "set" or "add", the PDF text is lost.  I verified this by searching for the same keyword after the update and the document is not found.  The Solr record for the document with the literal id field value "101" is still there after the update but the text is gone.  Why does Solr delete the PDF text after any update of the record for the document, and is there a way to prevent that?
>
> Regards,
> Alan