You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ronald Wood <rw...@smarsh.com> on 2016/08/23 20:01:03 UTC

Is it safe to upgrade an existing field to docvalues?

We are planning to migrate from Solr 4.10.4 to 5.5.2 in the next couple of months. We do not use SolrCloud.

When doing initial testing in our dev and qa environments we ran into cases where we got errors for fields that had docvalues newly enabled, but not re-indexed. Mixed docvalues/non-docvalues was possible due to ongoing indexing.

Specifically, when we tried to sort or facet we sometimes got errors like:

“IllegalStateException: unexpected docvalues type NONE for field 'id' (expected=SORTED). Use UninvertingReader or index with docvalues.” Id is a string field with docValues=true.

This did not always consistently happen, but any occurrence of this is troublesome.

My reading of tickets like https://issues.apache.org/jira/browse/SOLR-7190 is that when docvalues is not fully available, Solr will fall back to the UninvertingReader. But the error message seems to indicate this is not done automatically for a sort.

In general, is there a way to migrate existing indexes (we have petabytes of data) by enabling docvalues and incrementally re-indexing? We expect the latter would take a month using an atomic update process.

Could this be an artifact of having old 4.x indexes, and it would be wiser to first migrate the indexes to 5.x format before enabling docvalues? (We expect that would also take us a month using incremental optimize.)

Can someone clarify what migration paths to docvalues are likely to succeed?

Thanks!

-Ronald Wood.




Re: Is it safe to upgrade an existing field to docvalues?

Posted by Ronald Wood <rw...@smarsh.com>.
I created https://issues.apache.org/jira/browse/SOLR-9437 for the proposal below.

I suppose beside feasibility, there’s the question of whether the change is needed by others. I’d love to hear if it meets anyone else’s needs.

- Ronald S. Wood 


On 8/24/16, 15:08, "Ronald Wood" <rw...@smarsh.com> wrote:

    OK. Thank you, Alessandro, for clarifying this matter.
    
    The reason I wasn’t sure about this is that this is somewhat ambiguous in the documentation. In the 6.1 Guide I see: “If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues.”  Maybe that should read “...in order to successfully sort or facet on that field or use any other features that depend on docValues. Partially converted indexes will result in exceptions because of inconsistent data for docValues. ”
    
    Moreover, as I mentioned in my first post, I saw some indication that Solr will fall back to using the UninvertingReader if it doesn’t find docValues as expected.
    
    In my testing, I did see that /export was definitely an all or nothing case: all data had to be docValues before I could get data. /select mostly works – except when it occasionally doesn’t.
    
    -------------------
    
    *I wonder if I can make a proposal*: would it be possible to add a property to the schema called useDocValues=true/false, defaulting to true?
    
    The idea would be that if docValues=true, indexing docValues would be as before, but Solr would not use them as long as useDocValues=false.
    
    Once anyone using this is sure that docValues are fully indexed, set useDocValues=true (or remove), and Solr would behave as now.
    
    I spent a little time going down into the code and at first glance this seems feasible. I would be willing to log the ticket and perhaps provide a patch.
    
    Does this sound feasible to anyone else? I am uncertain if this requires any changes at the Lucene level, but looking at Solr core code all the switching is done in Solr on field.hasDocValues. The code would be amended to (field.hasDocValues && field.useDocValues) throughout.
    
    I would have to imagine this would be helpful to others out there with large amounts of data to migrate.
    
    - Ronald S. Wood 
    
    
    On 8/24/16, 10:14, "Alessandro Benedetti" <ab...@apache.org> wrote:
    
        I am sorry Ronald but :
        "  ask because my presupposition has been that we could turn it on without
        any harm as we incrementally converted our indexes."
        
        This is not possible, if you change the schema and then slowly update the
        documents you are introducing inconsistency that will reflect in sorting
        and faceting.
        Because solr will check the field attributes, will see docValues, but then
        will find only partial docValues.
        So the docValue for some documents will be null.
        
        You need to go live one-shot.
        This is the reason Shawn and Toke suggest a parallel index, with the
        docValues enabled and finally you swap.
        
        Cheers
        
        On Wed, Aug 24, 2016 at 2:56 PM, Shawn Heisey <ap...@elyograg.org> wrote:
        
        > On 8/23/2016 2:01 PM, Ronald Wood wrote:
        > > In general, is there a way to migrate existing indexes (we have
        > petabytes of data) by enabling docvalues and incrementally re-indexing? We
        > expect the latter would take a month using an atomic update process.
        >
        > One way to handle it is to build a new index with an updated
        > configuration, then switch to the new index.  Since you're not running
        > SolrCloud, you can switch by swapping the cores.  If you were running
        > SolrCloud, you'd need to alias the old name to the new collection, which
        > might involve deleting the old collection first.  Swapping cores in
        > cloud mode will break things.
        >
        > The other replies you've gotten are interesting.  The approach using
        > Atomic Updates will only work if your index meets the requirements for
        > Atomic Updates.
        >
        > https://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations
        >
        > You've already said it would take a month using atomic update ... which
        > might mean you've already thought about whether or not your index meets
        > the requirements.
        >
        > Toke's tool looks quite interesting, and would probably do the job a lot
        > faster than any other method.
        >
        > Thanks,
        > Shawn
        >
        >
        
        
        -- 
        --------------------------
        
        Benedetti Alessandro
        Visiting card : http://about.me/alessandro_benedetti
        
        "Tyger, tyger burning bright
        In the forests of the night,
        What immortal hand or eye
        Could frame thy fearful symmetry?"
        
        William Blake - Songs of Experience -1794 England
        
    
    
    



Re: Is it safe to upgrade an existing field to docvalues?

Posted by Alessandro Benedetti <ab...@apache.org>.
Of course I see your point Ronald, and don't get me wrong, I don't think it
is a bad idea.
I simply think can bring some complexity and confusion if we start to use
it as a common approach.
Anyway let's see what the other Solr gurus think :)

Cheers

On Thu, Aug 25, 2016 at 2:21 PM, Ronald Wood <rw...@smarsh.com> wrote:

> Alessandro, yes I can see how this could be conceived of as a more general
> problem; and yes useDocValues also strikes me as being unlike the other
> properties since it would only be used temporarily.
>
> We’ve actually had to migrate fields from one to another when changing
> types, along with awkward naming like ‘fieldName’ (int) to ‘fieldNameLong’.
> But I’m not sure how a change like that could actually be done in place.
>
> The point is stronger when it comes to term vectors etc. where data exists
> in separate files and switches in code control whether they are used or not.
>
> I guess where I would argue that docValues might be different is that so
> much new functionality depends on this that it might be worth treating it
> differently. Given that docValues now is on by default, I wonder if it will
> at some point be mandatory, in which case everyone would have to migrate to
> keep up with Solr version. (Of course, I don’t know what the general
> thinking is on this amongst the implementers.)
>
> Regardless, this change may be so important to us that we’d choose to
> branch the code on GitHub and apply the patch ourselves, use it while we
> transition, and then deploy an official build once we’re done. The
> difference in the level of effort between this approach and the
> alternatives would be too great. The risks of using a custom build for
> production would have to be weighed carefully, naturally.
>
> - Ronald S. Wood
>
>
> On 8/25/16, 06:49, "Alessandro Benedetti" <ab...@apache.org> wrote:
>
>     > switching is done in Solr on field.hasDocValues. The code would be
> amended
>     > to (field.hasDocValues && field.useDocValues) throughout.
>     >
>
>     This is correct. Currently we use DocValues if they are available, and
> to
>     check the availabilty we check the schema attribute.
>     This can be problematic in the scenarios you described ( for example
> half
>     the index has docValues for a field and the other half not yet ).
>
>     Your proposal is interesting.
>     Technically it should work and should allow transparent migration from
> not
>     docValues to docValues.
>     But it is a risky one, because we are decreasing the readability a bit
> (
>     althought a user will specify the attribute only in special cases like
>     yours) .
>
>     The only problem I see is that the same discussion we had for docValues
>     actually applies to all other invasive schema changes :
>     1) you change the field type
>     2) you enable or disable term vectors
>     3) you enable/disable term positions,offsets ect ect
>
>     So basically this is actually a general problem, that probably would
>     require a general re-think .
>     So although  can be a quick fix that will work, I fear can open the
> road to
>     messy configuration attributes.
>
>     Cheers
>     --
>     --------------------------
>
>     Benedetti Alessandro
>     Visiting card : http://about.me/alessandro_benedetti
>
>     "Tyger, tyger burning bright
>     In the forests of the night,
>     What immortal hand or eye
>     Could frame thy fearful symmetry?"
>
>     William Blake - Songs of Experience -1794 England
>
>
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Is it safe to upgrade an existing field to docvalues?

Posted by Pushkar Raste <pu...@gmail.com>.
Hi Ronald,
Turning on docValues for existing field works in Solr 4. As you mentioned
it will use un-inverting method if docValues are nit found on existing
document. This all works fine until segments that have documents without
docValues merge with segment that have docValues for the field. In the
merged segment documents from the old segment will be stored without
docValues however segment's metadata will indicate docValues are turned ON
for the field in question.

Now if you are sorting on the field those poor documents would seem out of
order and facet counts would be wrong as well.

Solr 5 doesn't throws exception if you have mixed case of docValues for a
field.

I think it is better to crate a copy field, reindex all of the data and
then switch over to use copy field

On Aug 25, 2016 9:21 AM, "Ronald Wood" <rw...@smarsh.com> wrote:

> Alessandro, yes I can see how this could be conceived of as a more general
> problem; and yes useDocValues also strikes me as being unlike the other
> properties since it would only be used temporarily.
>
> We’ve actually had to migrate fields from one to another when changing
> types, along with awkward naming like ‘fieldName’ (int) to ‘fieldNameLong’.
> But I’m not sure how a change like that could actually be done in place.
>
> The point is stronger when it comes to term vectors etc. where data exists
> in separate files and switches in code control whether they are used or not.
>
> I guess where I would argue that docValues might be different is that so
> much new functionality depends on this that it might be worth treating it
> differently. Given that docValues now is on by default, I wonder if it will
> at some point be mandatory, in which case everyone would have to migrate to
> keep up with Solr version. (Of course, I don’t know what the general
> thinking is on this amongst the implementers.)
>
> Regardless, this change may be so important to us that we’d choose to
> branch the code on GitHub and apply the patch ourselves, use it while we
> transition, and then deploy an official build once we’re done. The
> difference in the level of effort between this approach and the
> alternatives would be too great. The risks of using a custom build for
> production would have to be weighed carefully, naturally.
>
> - Ronald S. Wood
>
>
> On 8/25/16, 06:49, "Alessandro Benedetti" <ab...@apache.org> wrote:
>
>     > switching is done in Solr on field.hasDocValues. The code would be
> amended
>     > to (field.hasDocValues && field.useDocValues) throughout.
>     >
>
>     This is correct. Currently we use DocValues if they are available, and
> to
>     check the availabilty we check the schema attribute.
>     This can be problematic in the scenarios you described ( for example
> half
>     the index has docValues for a field and the other half not yet ).
>
>     Your proposal is interesting.
>     Technically it should work and should allow transparent migration from
> not
>     docValues to docValues.
>     But it is a risky one, because we are decreasing the readability a bit
> (
>     althought a user will specify the attribute only in special cases like
>     yours) .
>
>     The only problem I see is that the same discussion we had for docValues
>     actually applies to all other invasive schema changes :
>     1) you change the field type
>     2) you enable or disable term vectors
>     3) you enable/disable term positions,offsets ect ect
>
>     So basically this is actually a general problem, that probably would
>     require a general re-think .
>     So although  can be a quick fix that will work, I fear can open the
> road to
>     messy configuration attributes.
>
>     Cheers
>     --
>     --------------------------
>
>     Benedetti Alessandro
>     Visiting card : http://about.me/alessandro_benedetti
>
>     "Tyger, tyger burning bright
>     In the forests of the night,
>     What immortal hand or eye
>     Could frame thy fearful symmetry?"
>
>     William Blake - Songs of Experience -1794 England
>
>
>
>

Re: Is it safe to upgrade an existing field to docvalues?

Posted by Ronald Wood <rw...@smarsh.com>.
Alessandro, yes I can see how this could be conceived of as a more general problem; and yes useDocValues also strikes me as being unlike the other properties since it would only be used temporarily.

We’ve actually had to migrate fields from one to another when changing types, along with awkward naming like ‘fieldName’ (int) to ‘fieldNameLong’. But I’m not sure how a change like that could actually be done in place.

The point is stronger when it comes to term vectors etc. where data exists in separate files and switches in code control whether they are used or not.

I guess where I would argue that docValues might be different is that so much new functionality depends on this that it might be worth treating it differently. Given that docValues now is on by default, I wonder if it will at some point be mandatory, in which case everyone would have to migrate to keep up with Solr version. (Of course, I don’t know what the general thinking is on this amongst the implementers.)

Regardless, this change may be so important to us that we’d choose to branch the code on GitHub and apply the patch ourselves, use it while we transition, and then deploy an official build once we’re done. The difference in the level of effort between this approach and the alternatives would be too great. The risks of using a custom build for production would have to be weighed carefully, naturally.

- Ronald S. Wood 


On 8/25/16, 06:49, "Alessandro Benedetti" <ab...@apache.org> wrote:

    > switching is done in Solr on field.hasDocValues. The code would be amended
    > to (field.hasDocValues && field.useDocValues) throughout.
    >
    
    This is correct. Currently we use DocValues if they are available, and to
    check the availabilty we check the schema attribute.
    This can be problematic in the scenarios you described ( for example half
    the index has docValues for a field and the other half not yet ).
    
    Your proposal is interesting.
    Technically it should work and should allow transparent migration from not
    docValues to docValues.
    But it is a risky one, because we are decreasing the readability a bit (
    althought a user will specify the attribute only in special cases like
    yours) .
    
    The only problem I see is that the same discussion we had for docValues
    actually applies to all other invasive schema changes :
    1) you change the field type
    2) you enable or disable term vectors
    3) you enable/disable term positions,offsets ect ect
    
    So basically this is actually a general problem, that probably would
    require a general re-think .
    So although  can be a quick fix that will work, I fear can open the road to
    messy configuration attributes.
    
    Cheers
    -- 
    --------------------------
    
    Benedetti Alessandro
    Visiting card : http://about.me/alessandro_benedetti
    
    "Tyger, tyger burning bright
    In the forests of the night,
    What immortal hand or eye
    Could frame thy fearful symmetry?"
    
    William Blake - Songs of Experience -1794 England
    



Re: Is it safe to upgrade an existing field to docvalues?

Posted by Alessandro Benedetti <ab...@apache.org>.
> switching is done in Solr on field.hasDocValues. The code would be amended
> to (field.hasDocValues && field.useDocValues) throughout.
>

This is correct. Currently we use DocValues if they are available, and to
check the availabilty we check the schema attribute.
This can be problematic in the scenarios you described ( for example half
the index has docValues for a field and the other half not yet ).

Your proposal is interesting.
Technically it should work and should allow transparent migration from not
docValues to docValues.
But it is a risky one, because we are decreasing the readability a bit (
althought a user will specify the attribute only in special cases like
yours) .

The only problem I see is that the same discussion we had for docValues
actually applies to all other invasive schema changes :
1) you change the field type
2) you enable or disable term vectors
3) you enable/disable term positions,offsets ect ect

So basically this is actually a general problem, that probably would
require a general re-think .
So although  can be a quick fix that will work, I fear can open the road to
messy configuration attributes.

Cheers
-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Is it safe to upgrade an existing field to docvalues?

Posted by Ronald Wood <rw...@smarsh.com>.
OK. Thank you, Alessandro, for clarifying this matter.

The reason I wasn’t sure about this is that this is somewhat ambiguous in the documentation. In the 6.1 Guide I see: “If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues.”  Maybe that should read “...in order to successfully sort or facet on that field or use any other features that depend on docValues. Partially converted indexes will result in exceptions because of inconsistent data for docValues. ”

Moreover, as I mentioned in my first post, I saw some indication that Solr will fall back to using the UninvertingReader if it doesn’t find docValues as expected.

In my testing, I did see that /export was definitely an all or nothing case: all data had to be docValues before I could get data. /select mostly works – except when it occasionally doesn’t.

-------------------

*I wonder if I can make a proposal*: would it be possible to add a property to the schema called useDocValues=true/false, defaulting to true?

The idea would be that if docValues=true, indexing docValues would be as before, but Solr would not use them as long as useDocValues=false.

Once anyone using this is sure that docValues are fully indexed, set useDocValues=true (or remove), and Solr would behave as now.

I spent a little time going down into the code and at first glance this seems feasible. I would be willing to log the ticket and perhaps provide a patch.

Does this sound feasible to anyone else? I am uncertain if this requires any changes at the Lucene level, but looking at Solr core code all the switching is done in Solr on field.hasDocValues. The code would be amended to (field.hasDocValues && field.useDocValues) throughout.

I would have to imagine this would be helpful to others out there with large amounts of data to migrate.

- Ronald S. Wood 


On 8/24/16, 10:14, "Alessandro Benedetti" <ab...@apache.org> wrote:

    I am sorry Ronald but :
    "  ask because my presupposition has been that we could turn it on without
    any harm as we incrementally converted our indexes."
    
    This is not possible, if you change the schema and then slowly update the
    documents you are introducing inconsistency that will reflect in sorting
    and faceting.
    Because solr will check the field attributes, will see docValues, but then
    will find only partial docValues.
    So the docValue for some documents will be null.
    
    You need to go live one-shot.
    This is the reason Shawn and Toke suggest a parallel index, with the
    docValues enabled and finally you swap.
    
    Cheers
    
    On Wed, Aug 24, 2016 at 2:56 PM, Shawn Heisey <ap...@elyograg.org> wrote:
    
    > On 8/23/2016 2:01 PM, Ronald Wood wrote:
    > > In general, is there a way to migrate existing indexes (we have
    > petabytes of data) by enabling docvalues and incrementally re-indexing? We
    > expect the latter would take a month using an atomic update process.
    >
    > One way to handle it is to build a new index with an updated
    > configuration, then switch to the new index.  Since you're not running
    > SolrCloud, you can switch by swapping the cores.  If you were running
    > SolrCloud, you'd need to alias the old name to the new collection, which
    > might involve deleting the old collection first.  Swapping cores in
    > cloud mode will break things.
    >
    > The other replies you've gotten are interesting.  The approach using
    > Atomic Updates will only work if your index meets the requirements for
    > Atomic Updates.
    >
    > https://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations
    >
    > You've already said it would take a month using atomic update ... which
    > might mean you've already thought about whether or not your index meets
    > the requirements.
    >
    > Toke's tool looks quite interesting, and would probably do the job a lot
    > faster than any other method.
    >
    > Thanks,
    > Shawn
    >
    >
    
    
    -- 
    --------------------------
    
    Benedetti Alessandro
    Visiting card : http://about.me/alessandro_benedetti
    
    "Tyger, tyger burning bright
    In the forests of the night,
    What immortal hand or eye
    Could frame thy fearful symmetry?"
    
    William Blake - Songs of Experience -1794 England
    



Re: Is it safe to upgrade an existing field to docvalues?

Posted by Alessandro Benedetti <ab...@apache.org>.
I am sorry Ronald but :
"  ask because my presupposition has been that we could turn it on without
any harm as we incrementally converted our indexes."

This is not possible, if you change the schema and then slowly update the
documents you are introducing inconsistency that will reflect in sorting
and faceting.
Because solr will check the field attributes, will see docValues, but then
will find only partial docValues.
So the docValue for some documents will be null.

You need to go live one-shot.
This is the reason Shawn and Toke suggest a parallel index, with the
docValues enabled and finally you swap.

Cheers

On Wed, Aug 24, 2016 at 2:56 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 8/23/2016 2:01 PM, Ronald Wood wrote:
> > In general, is there a way to migrate existing indexes (we have
> petabytes of data) by enabling docvalues and incrementally re-indexing? We
> expect the latter would take a month using an atomic update process.
>
> One way to handle it is to build a new index with an updated
> configuration, then switch to the new index.  Since you're not running
> SolrCloud, you can switch by swapping the cores.  If you were running
> SolrCloud, you'd need to alias the old name to the new collection, which
> might involve deleting the old collection first.  Swapping cores in
> cloud mode will break things.
>
> The other replies you've gotten are interesting.  The approach using
> Atomic Updates will only work if your index meets the requirements for
> Atomic Updates.
>
> https://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations
>
> You've already said it would take a month using atomic update ... which
> might mean you've already thought about whether or not your index meets
> the requirements.
>
> Toke's tool looks quite interesting, and would probably do the job a lot
> faster than any other method.
>
> Thanks,
> Shawn
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Is it safe to upgrade an existing field to docvalues?

Posted by Ronald Wood <rw...@smarsh.com>.
Yes, Shawn, our indexes meet the requirements for atomic updates. 

We actually depend on atomic updates since our users can alter metadata about any of our indexed records. We don’t have to incur the cost of a full re-index of a record for every change. This is especially critical when a user does a bulk update of the status of 1 million records. ☺

- Ronald S. Wood


On 8/24/16, 09:56, "Shawn Heisey" <ap...@elyograg.org> wrote:

    On 8/23/2016 2:01 PM, Ronald Wood wrote:
    > In general, is there a way to migrate existing indexes (we have petabytes of data) by enabling docvalues and incrementally re-indexing? We expect the latter would take a month using an atomic update process.
    
    One way to handle it is to build a new index with an updated
    configuration, then switch to the new index.  Since you're not running
    SolrCloud, you can switch by swapping the cores.  If you were running
    SolrCloud, you'd need to alias the old name to the new collection, which
    might involve deleting the old collection first.  Swapping cores in
    cloud mode will break things.
    
    The other replies you've gotten are interesting.  The approach using
    Atomic Updates will only work if your index meets the requirements for
    Atomic Updates.
    
    https://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations
    
    You've already said it would take a month using atomic update ... which
    might mean you've already thought about whether or not your index meets
    the requirements.
    
    Toke's tool looks quite interesting, and would probably do the job a lot
    faster than any other method.
    
    Thanks,
    Shawn
    
    



Re: Is it safe to upgrade an existing field to docvalues?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 8/23/2016 2:01 PM, Ronald Wood wrote:
> In general, is there a way to migrate existing indexes (we have petabytes of data) by enabling docvalues and incrementally re-indexing? We expect the latter would take a month using an atomic update process.

One way to handle it is to build a new index with an updated
configuration, then switch to the new index.  Since you're not running
SolrCloud, you can switch by swapping the cores.  If you were running
SolrCloud, you'd need to alias the old name to the new collection, which
might involve deleting the old collection first.  Swapping cores in
cloud mode will break things.

The other replies you've gotten are interesting.  The approach using
Atomic Updates will only work if your index meets the requirements for
Atomic Updates.

https://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations

You've already said it would take a month using atomic update ... which
might mean you've already thought about whether or not your index meets
the requirements.

Toke's tool looks quite interesting, and would probably do the job a lot
faster than any other method.

Thanks,
Shawn


Re: Is it safe to upgrade an existing field to docvalues?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Alessandro Benedetti <ab...@apache.org> wrote:
> So basically using your tool you build a copy of the index ( similar to
> what optimize does) without affecting the main index right ?

Yes. Your step-by-step is spot on.

In the end we re-indexed everything, because there were other issues with the index we wanted to fix, so DVEnabler is very much a limited implementation. I am sure that the conversion code can be made a lot faster. 

I could envision a better tool that just used a source index and a destination schema and did the conversion with no setup-fuss. One could also add other usable conversions besides DV-switching: Multi-value-fields with always 1 value/doc could be changed to true single-value, and vice-versa. String fields with numerics could be converted to true numerics. etc. Most of the data is already in the index, so it is "just" a question of re-packing it.

> This is actually a useful tool when re-indexing could be extremely long.

Thank you.


Important note: I stated that I created that tool, which I apologize for. Thomas Egense and I wrote it jointly.

- Toke Eskildsen

Re: Is it safe to upgrade an existing field to docvalues?

Posted by Ronald Wood <rw...@smarsh.com>.
Thanks, Toke. 

I’m still surveying the code; do you know of a place in the code that might be more problematic?

We’d be mainly concerned about searching, sorting and (simple, low-cardinality) faceting working for us.

Some features like grouping are not currently used by us, so in a pinch a custom build might be a partial patch. We’ll just have to see.

- Ronald S. Wood


On 8/25/16, 06:50, "Toke Eskildsen" <te...@statsbiblioteket.dk> wrote:

    Ronald Wood <rw...@smarsh.com> wrote:
    > Did you find you had to do a full conversion all at once because simply turning on
    > docvalues in the schema caused issues?
    
    Yes.
    
    > I ask because my presupposition has been that we could turn it on without any
    > harm as we incrementally converted our indexes.
    
    If you don't use the field for any queries until all the values has been re-build, I guess that would work. The I-am-not-so-sure-part is how the merger handles the case of a field having DocValues in one segment and not in another.
    
    But mixing docValued & non-docValued segments with a schema that says DocValues will make DocValue-using queries fail, as you seem to have encountered:
    
    > But this won’t work if enabling docvalues in the schema will lead to errors when
    > fields don’t have docvalues actually populated. I.e.. the “IllegalStateException:
    > unexpected docvalues type NONE for field 'id' (expected=SORTED)” error I see.
    
    > I’m still trying to get to the bottom of whether that error means I cannot safely do
    > an incremental conversion in-place.
    
    When you enable docValues in the Solr schema, Solr also uses that information when reading the data from the segments, so when the code detects the missing docValues in the segments themselves, it is already too far down the execution path to change strategy. Basically the contract (schema) is broken, so all bets are off.
    
    Your gradual enabling would work, at least for faceting, if it was possible to force the selection code to only use the indexed values, but the current code does not have an option for forcing the use of the indexed value. You could add it as a (small) hack, if you are comfortable with that. I don't know how easy or hard it would to hack grouping.
    
    - Toke Eskildsen
    



Re: Is it safe to upgrade an existing field to docvalues?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Ronald Wood <rw...@smarsh.com> wrote:
> Did you find you had to do a full conversion all at once because simply turning on
> docvalues in the schema caused issues?

Yes.

> I ask because my presupposition has been that we could turn it on without any
> harm as we incrementally converted our indexes.

If you don't use the field for any queries until all the values has been re-build, I guess that would work. The I-am-not-so-sure-part is how the merger handles the case of a field having DocValues in one segment and not in another.

But mixing docValued & non-docValued segments with a schema that says DocValues will make DocValue-using queries fail, as you seem to have encountered:

> But this won’t work if enabling docvalues in the schema will lead to errors when
> fields don’t have docvalues actually populated. I.e.. the “IllegalStateException:
> unexpected docvalues type NONE for field 'id' (expected=SORTED)” error I see.

> I’m still trying to get to the bottom of whether that error means I cannot safely do
> an incremental conversion in-place.

When you enable docValues in the Solr schema, Solr also uses that information when reading the data from the segments, so when the code detects the missing docValues in the segments themselves, it is already too far down the execution path to change strategy. Basically the contract (schema) is broken, so all bets are off.

Your gradual enabling would work, at least for faceting, if it was possible to force the selection code to only use the indexed values, but the current code does not have an option for forcing the use of the indexed value. You could add it as a (small) hack, if you are comfortable with that. I don't know how easy or hard it would to hack grouping.

- Toke Eskildsen

Re: Is it safe to upgrade an existing field to docvalues?

Posted by Ronald Wood <rw...@smarsh.com>.
Thanks Toke. I’ve read some of your other helpful blog entries but I missed that one. 

Did you find you had to do a full conversion all at once because simply turning on docvalues in the schema caused issues? 

I ask because my presupposition has been that we could turn it on without any harm as we incrementally converted our indexes.

I have actually written a tool that will round robin between different indexes on the same server and update 10,000 items every two minutes by doing a simple atomic update (updating a date field). This actually works well in my testing. 

But this won’t work if enabling docvalues in the schema will lead to errors when fields don’t have docvalues actually populated. I.e.. the “IllegalStateException: unexpected docvalues type NONE for field 'id' (expected=SORTED)” error I see.

I’m still trying to get to the bottom of whether that error means I cannot safely do an incremental conversion in-place.

Your approach may come in handy if we cannot.

-Ronald.

On 8/24/16, 07:42, "Alessandro Benedetti" <ab...@apache.org> wrote:

    Hi Toke !
    Good stuff !
    
    So basically using your tool you build a copy of the index ( similar to
    what optimize does) without affecting the main index right ?
    So the procedure would be :
    
    1) Solr is running
    2) Run the tool pointing to the Solr Index, this will be slow but will
    generate a new index copying from the stored content to docValues
    3) Stop Solr
    4) Change Solr schema enabling docValues
    5) Point to the new converted index
    6) Start Solr
    
    Am I right ?
    This is actually a useful tool when re-indexing could be extremely long.
    
    Cheers
    
    On Wed, Aug 24, 2016 at 12:05 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
    wrote:
    
    > On Tue, 2016-08-23 at 20:01 +0000, Ronald Wood wrote:
    > > In general, is there a way to migrate existing indexes (we have
    > > petabytes of data) by enabling docvalues and incrementally re-
    > > indexing? We expect the latter would take a month using an atomic
    > > update process.
    >
    > I did write a tool for that at some point, that works with Solr 4.x.
    > https://github.com/netarchivesuite/dvenabler
    >
    > It is not polished though, and the main problem is that it is slow to
    > convert. Some details at https://sbdevel.wordpress.com/2014/12/15/chang
    > ing-field-type-in-lucenesolr/
    >
    > - Toke Eskildsen
    >
    >
    
    
    -- 
    --------------------------
    
    Benedetti Alessandro
    Visiting card : http://about.me/alessandro_benedetti
    
    "Tyger, tyger burning bright
    In the forests of the night,
    What immortal hand or eye
    Could frame thy fearful symmetry?"
    
    William Blake - Songs of Experience -1794 England
    



Re: Is it safe to upgrade an existing field to docvalues?

Posted by Alessandro Benedetti <ab...@apache.org>.
Hi Toke !
Good stuff !

So basically using your tool you build a copy of the index ( similar to
what optimize does) without affecting the main index right ?
So the procedure would be :

1) Solr is running
2) Run the tool pointing to the Solr Index, this will be slow but will
generate a new index copying from the stored content to docValues
3) Stop Solr
4) Change Solr schema enabling docValues
5) Point to the new converted index
6) Start Solr

Am I right ?
This is actually a useful tool when re-indexing could be extremely long.

Cheers

On Wed, Aug 24, 2016 at 12:05 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> On Tue, 2016-08-23 at 20:01 +0000, Ronald Wood wrote:
> > In general, is there a way to migrate existing indexes (we have
> > petabytes of data) by enabling docvalues and incrementally re-
> > indexing? We expect the latter would take a month using an atomic
> > update process.
>
> I did write a tool for that at some point, that works with Solr 4.x.
> https://github.com/netarchivesuite/dvenabler
>
> It is not polished though, and the main problem is that it is slow to
> convert. Some details at https://sbdevel.wordpress.com/2014/12/15/chang
> ing-field-type-in-lucenesolr/
>
> - Toke Eskildsen
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Is it safe to upgrade an existing field to docvalues?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Tue, 2016-08-23 at 20:01 +0000, Ronald Wood wrote:
> In general, is there a way to migrate existing indexes (we have
> petabytes of data) by enabling docvalues and incrementally re-
> indexing? We expect the latter would take a month using an atomic
> update process.

I did write a tool for that at some point, that works with Solr 4.x.
https://github.com/netarchivesuite/dvenabler

It is not polished though, and the main problem is that it is slow to
convert. Some details at�https://sbdevel.wordpress.com/2014/12/15/chang
ing-field-type-in-lucenesolr/

- Toke Eskildsen