You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tanguy Moal <ta...@gmail.com> on 2013/09/19 14:47:23 UTC

solr atomic updates stored="true", and copyField limitation

Hello,

I'm using solr 4.4. I have a solr core with a schema defining a bunch of different fields, and among them, a date field:
- date: indexed and stored       // the date used at search time
In practice it's a TrieDateField but I think that's not relevant for the concern.

It also has a multi valued, not required, "string" field named "tags" which contains, well a list of tags, for some of the documents.

So far, so good: everything works as expected and I'm glad.
I'm able to perform partial (or atomic) updates on the tags field whenever it gets modified, and I love it.

Now I have an new source that also pushes updates to the same solr core. Unfortunately, that source's incoming documents have their date in an other field, of the same type, named created_time instead of date.
- created_time: stored only      // some documents come in with this field set
To be able to sort any document by time, I decided to ask solr to copy the contents of the field created_time to the field named date:
 <copyField source="created_date" dest="date" />

I updated my schema and reloaded my core and everything seemed fine. In fact, I did break something 8-)
But I figured it out later…
Quoting http://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations :
> all fields in your SchemaXml must be configured as stored="true" except for fields which are <copyField/> destinations -- which must be configured as stored="false"


However at that time, I was not aware of the limitation and I was able to sort by time across all the documents in my solr core.
I then decided to make sure that partial (or atomic) updates could still be performed, and then I was surprised:
* documents from the more recent source (having both a date and a created_time field) are updated fine, the date field is kept (the copyField directive is replayed, I guess)
* documents from the first source (having only the date field set) are however a little bit less lucky: the date gets lost in process (looks like the date field was overridden by the execution of the copyField directive with nothing in its source field)

I then became aware of the caveats and limitations of atomic updates, but now I want to understand why ;-)

So my question is: What differs concerning copyField behaviours between a normal (classic) and a partial (atomic) update?
In practice, I don't understand why the targets of every copyField directives are *always* cleared during partial updates?
Could the clearing of the destination field be performed if one of the source field of a copyField is present in the atomic update only? May be we didn't want to do that because that would have put some complexity where it should not be (updates must be fast), but that's just an idea.

I have two ways to handle my problem:
1/ Create a stored="false" search_date field and have two copyFields directives, one for the original "date" field an another one for the newer "created_time" field, and make the search application rely on the search_date field
2/ Since I have some control over the second source pushing documents, I can make sure that documents are pushed with the same date field, and work around the limitation by removing the copyField directive entirely.
Since it simplifies my solr schema, I chose the option #2

Thank you very much for your attention

Tanguy

Re: solr atomic updates stored="true", and copyField limitation

Posted by Shawn Heisey <so...@elyograg.org>.
On 9/19/2013 6:47 AM, Tanguy Moal wrote:
> Quoting http://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations :
>> all fields in your SchemaXml must be configured as stored="true" except for fields which are <copyField/> destinations -- which must be configured as stored="false"

For fields created by copyField, the source field(s) should have
stored=true.  The destination field should have stored=false.

Forgetting about atomic updates for a minute, the reason is pretty
simple, especially if you have multiple source fields being dropped in
one destination fields:  Storing both of them makes your index bigger
and makes it take longer to retrieve search results, particularly with
version numbers 4.1 or later, because stored values are compressed.

I think you've hit on the exact reason why the caveat exists for
copyFields and atomic updates -- if the source field isn't stored, then
the actual indexed document won't have the source field, which means
that the "doesn't exist" value will be copied over to the destination,
overwriting any actual value that might exist for that field.

It's arguable that it's working as designed, and also working as
documented, both in the wiki and the reference guide, which both say
that all source fields must be stored.

https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
http://wiki.apache.org/solr/Atomic_Updates

You could still file a bug (jira issue) if you like, but given that the
documentation is pretty clear, it might not get fixed.

Thanks,
Shawn