You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by matthew sporleder <ms...@gmail.com> on 2020/09/14 19:12:23 UTC

join query limitations

I have hit a bit of a cross-road with our usage of solr where I want
to include some slightly dynamic data.

I want to ask solr to find things like "text query" but only if they
meet some specific criteria.  When I have all of those criteria
indexed, everything works great.  (text contains "apples", in_season=1
,sort by latest)

Now I would like to add a criteria which changes every day -
popularity of a document, specifically.  This appeared to be *the*
canonical use case for external field files but I have 50M documents
(and growing) so a *text* file doesn't fit the bill.

I also looked at using a !join but the limitations of !join, as I
understand them, appear to mean I can't use it for my use case? aka I
can't actually use the data from my traffic-stats core to sort/filter
"text contains" "apples", in_season=1, sort by most traffic, sort by
latest

The last option appears to be updating all of my documents every
single day, possibly using atomic/partial updates, but even those have
a growing list of gotchas: losing stored=false documents is a big one,
caveats I don't quite understand related to copyFields, changes to the
_version_ field (the _version_ field is also a non-indexed, non-stored
single valued docValues field;), etc

Where else can I look?  The last time we attempted something like this
we ended up rebuilding the index from scratch each day and shuffling
it out, which was really pretty nasty.

Thanks,
Matt

Re: join query limitations

Posted by matthew sporleder <ms...@gmail.com>.
This probably carried forward from a very old version organically.  I
am running 7.7

On Mon, Sep 14, 2020 at 6:25 PM Erick Erickson <er...@gmail.com> wrote:
>
> What version of Solr are you using? ‘cause 8x has this definition for _version_
>
> <!-- doc values are enabled by default for primitive types such as long so we don't index the version field  -->
>  <field name="_version_" type="plong" indexed="false" stored="false"/>
>
> and I find no text like you’re seeing in any schema file in 8x….
>
> So with a prior version, “try it and see”? See: https://issues.apache.org/jira/browse/SOLR-9449 and linked JIRAs,
> the _version_ can be indexed=“false” since 6.3 at least if it’s docValues=“true". It’s not clear to me that it needed
> to be indexed=“true” even before that, but no guarantees.
>
> updateLog will be defined in solrconfig.xml, but unless you’re on a very old version of Solr it doesn’t matter
> ‘cause you don’t need to have indexed=“true”. Updatelog is not necessary if you’re not running SolrCloud...
>
> I strongly urge you to completely remove all your indexes (perhaps create a new collection) and re-index
> from scratch if you change the definition. You might be able to get away with deleting all the docs then
> re-indexing, but just re-indexing all the docs without starting fresh can have “interesting” results.
>
> Best,
> Erick
>
> > On Sep 14, 2020, at 5:16 PM, matthew sporleder <ms...@gmail.com> wrote:
> >
> > Yes but "the _version_ field is also a non-indexed, non-stored single
> > valued docValues field;"  <- is that a problem?
> >
> > My schema has this:
> >  <!-- to use updateLog: _version_field must exist in schema, using
> >       indexed="true" stored="true" and multiValued="false"
> >  -->
> >  <field name="_version_" type="long" indexed="true" stored="true"/>
> >
> > I don't know if I use the updateLog or not.  How can I find out?
> >
> > I think that would work for me as I could just make a dynamic fild like:
> > <dynamicField name="*_atomici" type="int" indexed="false"
> > stored="false" multiValued="false" required="false" docValues="true"
> > />
> >
> > ---
> > Yes it is just for functions, sorting, and boosting
> >
> > On Mon, Sep 14, 2020 at 4:51 PM Erick Erickson <er...@gmail.com> wrote:
> >>
> >> Have you seen “In-place updates”?
> >>
> >> See:
> >> https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html
> >>
> >> Then use the field as part of a function query. Since it’s non-indexed, you
> >> won’t be searching on it. That said, you can do a lot with function queries
> >> to satisfy use-cases.
> >>
> >> Best.
> >> Erick
> >>
> >>> On Sep 14, 2020, at 3:12 PM, matthew sporleder <ms...@gmail.com> wrote:
> >>>
> >>> I have hit a bit of a cross-road with our usage of solr where I want
> >>> to include some slightly dynamic data.
> >>>
> >>> I want to ask solr to find things like "text query" but only if they
> >>> meet some specific criteria.  When I have all of those criteria
> >>> indexed, everything works great.  (text contains "apples", in_season=1
> >>> ,sort by latest)
> >>>
> >>> Now I would like to add a criteria which changes every day -
> >>> popularity of a document, specifically.  This appeared to be *the*
> >>> canonical use case for external field files but I have 50M documents
> >>> (and growing) so a *text* file doesn't fit the bill.
> >>>
> >>> I also looked at using a !join but the limitations of !join, as I
> >>> understand them, appear to mean I can't use it for my use case? aka I
> >>> can't actually use the data from my traffic-stats core to sort/filter
> >>> "text contains" "apples", in_season=1, sort by most traffic, sort by
> >>> latest
> >>>
> >>> The last option appears to be updating all of my documents every
> >>> single day, possibly using atomic/partial updates, but even those have
> >>> a growing list of gotchas: losing stored=false documents is a big one,
> >>> caveats I don't quite understand related to copyFields, changes to the
> >>> _version_ field (the _version_ field is also a non-indexed, non-stored
> >>> single valued docValues field;), etc
> >>>
> >>> Where else can I look?  The last time we attempted something like this
> >>> we ended up rebuilding the index from scratch each day and shuffling
> >>> it out, which was really pretty nasty.
> >>>
> >>> Thanks,
> >>> Matt
> >>
>

Re: join query limitations

Posted by Erick Erickson <er...@gmail.com>.
What version of Solr are you using? ‘cause 8x has this definition for _version_

<!-- doc values are enabled by default for primitive types such as long so we don't index the version field  -->
 <field name="_version_" type="plong" indexed="false" stored="false"/>

and I find no text like you’re seeing in any schema file in 8x….

So with a prior version, “try it and see”? See: https://issues.apache.org/jira/browse/SOLR-9449 and linked JIRAs,
the _version_ can be indexed=“false” since 6.3 at least if it’s docValues=“true". It’s not clear to me that it needed
to be indexed=“true” even before that, but no guarantees.

updateLog will be defined in solrconfig.xml, but unless you’re on a very old version of Solr it doesn’t matter 
‘cause you don’t need to have indexed=“true”. Updatelog is not necessary if you’re not running SolrCloud...

I strongly urge you to completely remove all your indexes (perhaps create a new collection) and re-index
from scratch if you change the definition. You might be able to get away with deleting all the docs then
re-indexing, but just re-indexing all the docs without starting fresh can have “interesting” results.

Best,
Erick

> On Sep 14, 2020, at 5:16 PM, matthew sporleder <ms...@gmail.com> wrote:
> 
> Yes but "the _version_ field is also a non-indexed, non-stored single
> valued docValues field;"  <- is that a problem?
> 
> My schema has this:
>  <!-- to use updateLog: _version_field must exist in schema, using
>       indexed="true" stored="true" and multiValued="false"
>  -->
>  <field name="_version_" type="long" indexed="true" stored="true"/>
> 
> I don't know if I use the updateLog or not.  How can I find out?
> 
> I think that would work for me as I could just make a dynamic fild like:
> <dynamicField name="*_atomici" type="int" indexed="false"
> stored="false" multiValued="false" required="false" docValues="true"
> />
> 
> ---
> Yes it is just for functions, sorting, and boosting
> 
> On Mon, Sep 14, 2020 at 4:51 PM Erick Erickson <er...@gmail.com> wrote:
>> 
>> Have you seen “In-place updates”?
>> 
>> See:
>> https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html
>> 
>> Then use the field as part of a function query. Since it’s non-indexed, you
>> won’t be searching on it. That said, you can do a lot with function queries
>> to satisfy use-cases.
>> 
>> Best.
>> Erick
>> 
>>> On Sep 14, 2020, at 3:12 PM, matthew sporleder <ms...@gmail.com> wrote:
>>> 
>>> I have hit a bit of a cross-road with our usage of solr where I want
>>> to include some slightly dynamic data.
>>> 
>>> I want to ask solr to find things like "text query" but only if they
>>> meet some specific criteria.  When I have all of those criteria
>>> indexed, everything works great.  (text contains "apples", in_season=1
>>> ,sort by latest)
>>> 
>>> Now I would like to add a criteria which changes every day -
>>> popularity of a document, specifically.  This appeared to be *the*
>>> canonical use case for external field files but I have 50M documents
>>> (and growing) so a *text* file doesn't fit the bill.
>>> 
>>> I also looked at using a !join but the limitations of !join, as I
>>> understand them, appear to mean I can't use it for my use case? aka I
>>> can't actually use the data from my traffic-stats core to sort/filter
>>> "text contains" "apples", in_season=1, sort by most traffic, sort by
>>> latest
>>> 
>>> The last option appears to be updating all of my documents every
>>> single day, possibly using atomic/partial updates, but even those have
>>> a growing list of gotchas: losing stored=false documents is a big one,
>>> caveats I don't quite understand related to copyFields, changes to the
>>> _version_ field (the _version_ field is also a non-indexed, non-stored
>>> single valued docValues field;), etc
>>> 
>>> Where else can I look?  The last time we attempted something like this
>>> we ended up rebuilding the index from scratch each day and shuffling
>>> it out, which was really pretty nasty.
>>> 
>>> Thanks,
>>> Matt
>> 


Re: join query limitations

Posted by matthew sporleder <ms...@gmail.com>.
Yes but "the _version_ field is also a non-indexed, non-stored single
valued docValues field;"  <- is that a problem?

My schema has this:
  <!-- to use updateLog: _version_field must exist in schema, using
       indexed="true" stored="true" and multiValued="false"
  -->
  <field name="_version_" type="long" indexed="true" stored="true"/>

I don't know if I use the updateLog or not.  How can I find out?

I think that would work for me as I could just make a dynamic fild like:
<dynamicField name="*_atomici" type="int" indexed="false"
stored="false" multiValued="false" required="false" docValues="true"
/>

---
Yes it is just for functions, sorting, and boosting

On Mon, Sep 14, 2020 at 4:51 PM Erick Erickson <er...@gmail.com> wrote:
>
> Have you seen “In-place updates”?
>
> See:
> https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html
>
> Then use the field as part of a function query. Since it’s non-indexed, you
> won’t be searching on it. That said, you can do a lot with function queries
> to satisfy use-cases.
>
> Best.
> Erick
>
> > On Sep 14, 2020, at 3:12 PM, matthew sporleder <ms...@gmail.com> wrote:
> >
> > I have hit a bit of a cross-road with our usage of solr where I want
> > to include some slightly dynamic data.
> >
> > I want to ask solr to find things like "text query" but only if they
> > meet some specific criteria.  When I have all of those criteria
> > indexed, everything works great.  (text contains "apples", in_season=1
> > ,sort by latest)
> >
> > Now I would like to add a criteria which changes every day -
> > popularity of a document, specifically.  This appeared to be *the*
> > canonical use case for external field files but I have 50M documents
> > (and growing) so a *text* file doesn't fit the bill.
> >
> > I also looked at using a !join but the limitations of !join, as I
> > understand them, appear to mean I can't use it for my use case? aka I
> > can't actually use the data from my traffic-stats core to sort/filter
> > "text contains" "apples", in_season=1, sort by most traffic, sort by
> > latest
> >
> > The last option appears to be updating all of my documents every
> > single day, possibly using atomic/partial updates, but even those have
> > a growing list of gotchas: losing stored=false documents is a big one,
> > caveats I don't quite understand related to copyFields, changes to the
> > _version_ field (the _version_ field is also a non-indexed, non-stored
> > single valued docValues field;), etc
> >
> > Where else can I look?  The last time we attempted something like this
> > we ended up rebuilding the index from scratch each day and shuffling
> > it out, which was really pretty nasty.
> >
> > Thanks,
> > Matt
>

Re: join query limitations

Posted by Erick Erickson <er...@gmail.com>.
Have you seen “In-place updates”?

See: 
https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html

Then use the field as part of a function query. Since it’s non-indexed, you
won’t be searching on it. That said, you can do a lot with function queries
to satisfy use-cases.

Best.
Erick

> On Sep 14, 2020, at 3:12 PM, matthew sporleder <ms...@gmail.com> wrote:
> 
> I have hit a bit of a cross-road with our usage of solr where I want
> to include some slightly dynamic data.
> 
> I want to ask solr to find things like "text query" but only if they
> meet some specific criteria.  When I have all of those criteria
> indexed, everything works great.  (text contains "apples", in_season=1
> ,sort by latest)
> 
> Now I would like to add a criteria which changes every day -
> popularity of a document, specifically.  This appeared to be *the*
> canonical use case for external field files but I have 50M documents
> (and growing) so a *text* file doesn't fit the bill.
> 
> I also looked at using a !join but the limitations of !join, as I
> understand them, appear to mean I can't use it for my use case? aka I
> can't actually use the data from my traffic-stats core to sort/filter
> "text contains" "apples", in_season=1, sort by most traffic, sort by
> latest
> 
> The last option appears to be updating all of my documents every
> single day, possibly using atomic/partial updates, but even those have
> a growing list of gotchas: losing stored=false documents is a big one,
> caveats I don't quite understand related to copyFields, changes to the
> _version_ field (the _version_ field is also a non-indexed, non-stored
> single valued docValues field;), etc
> 
> Where else can I look?  The last time we attempted something like this
> we ended up rebuilding the index from scratch each day and shuffling
> it out, which was really pretty nasty.
> 
> Thanks,
> Matt