You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "S.L" <si...@gmail.com> on 2014/11/13 06:45:36 UTC

Can we query on _version_field ?

Hi All,

We know that _version_field is a mandatory field in solrcloud schema.xml,
it is expected to be of type long , it also seems to have unique value in a
collection.

However the query of the form
http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
does not seems to return any record , can we query on the _version_field in
the schema.xml ?

Thank you.

Re: Can we query on _version_field ?

Posted by Erick Erickson <er...@gmail.com>.
bq: ..._version_ will change on updates" , shouldnt that be OK....

Absolutely not OK. Lucene/Solr relies on the uniqueKey being
identical to define different documents. So if you update a doc
it _must_ have the same uniqueKey or it gets added as a
completely new document in addition to the old one. Having the
_version_ field change on you when you update docs (and
this is _not_ under your control) seems... fraught.

Net-net is then you have two visible copies of the same document.
Not good.

You must have something you can use as a uniqueKey. You say
"If I do a look up based on URL , I am bound to face issues with
character escaping and all"

How do you propose to correlate the UUID field to the URL for
your lookups anyway?

 You say:
"To avoid that I was using a UUID for look up , but in SolrCloud it
generates unique per replica , which is not acceptable"

Why not?

The whole _point_ of UUIDs is that they're, well, unique (or at
least very close) no matter where/when they're created so why is
it a problem to generate them on different replicas (NOT as the
uniqueKey however)?

But you still have to make the UUID <-> URL connction, where is
that being handled?

All in all, it seems like you're making this much more difficult than
it needs to be and would be well-served by
1> learning to escape the URLs
or
2> massaging the URL to something more consumable and living
with what might be very occasional duplication
or
3> generate your own UUID on a single machine during indexing
and inject that into the record (not with UUIDProcesor..., just
the Java class assuming your ingestion is Java based).
or
4> trusting the UUID generation code will keep UUIDs that
are automatically generated on different machines unique enough for
practical purposes.


Best,
Erick

On Thu, Nov 13, 2014 at 12:06 PM, Michael Della Bitta
<mi...@appinions.com> wrote:
> You could also find a natural key that doesn't look like an ID and create a
> name-based (Type 3) UUID out of it, with something like Java's
> nameUUIDFromBytes:
>
> https://docs.oracle.com/javase/7/docs/api/java/util/UUID.html#nameUUIDFromBytes%28byte%5B%5D%29
>
> Implementations of this exist in other languages as well.
>
>
> On 11/13/14 11:35, Shawn Heisey wrote:
>>
>> On 11/12/2014 10:45 PM, S.L wrote:
>>>
>>> We know that _version_field is a mandatory field in solrcloud schema.xml,
>>> it is expected to be of type long , it also seems to have unique value in
>>> a
>>> collection.
>>>
>>> However the query of the form
>>>
>>> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
>>> does not seems to return any record , can we query on the _version_field
>>> in
>>> the schema.xml ?
>>
>> I've been watching your journey unfold on the mailing list.  The whole
>> thing seems like an XY problem.
>>
>> If I'm reading everything correctly, you want to have a unique ID value
>> that can serve as the uniqueKey, as well as a way to quickly look up a
>> single document in Solr.
>>
>> Is there one part of the URL that serves as a unique identifier that
>> doesn't contain special characters?  It seems insane that you would not
>> have a unique ID value for every entity in your system that is composed
>> of only "regular" characters.
>>
>> Assuming that such an ID exists (and is likely used as one piece of that
>> doctorURL that you mentioned) ... if you can extract that ID value into
>> its own field (either in your indexing code or a custom update
>> processor), you could use that for both uniqueKey and single-document
>> lookups.  Having that kind of information in your index seems like a
>> generally good idea.
>>
>> Thanks,
>> Shawn
>>
>

Re: Can we query on _version_field ?

Posted by Michael Della Bitta <mi...@appinions.com>.
You could also find a natural key that doesn't look like an ID and 
create a name-based (Type 3) UUID out of it, with something like Java's 
nameUUIDFromBytes:

https://docs.oracle.com/javase/7/docs/api/java/util/UUID.html#nameUUIDFromBytes%28byte%5B%5D%29

Implementations of this exist in other languages as well.

On 11/13/14 11:35, Shawn Heisey wrote:
> On 11/12/2014 10:45 PM, S.L wrote:
>> We know that _version_field is a mandatory field in solrcloud schema.xml,
>> it is expected to be of type long , it also seems to have unique value in a
>> collection.
>>
>> However the query of the form
>> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
>> does not seems to return any record , can we query on the _version_field in
>> the schema.xml ?
> I've been watching your journey unfold on the mailing list.  The whole
> thing seems like an XY problem.
>
> If I'm reading everything correctly, you want to have a unique ID value
> that can serve as the uniqueKey, as well as a way to quickly look up a
> single document in Solr.
>
> Is there one part of the URL that serves as a unique identifier that
> doesn't contain special characters?  It seems insane that you would not
> have a unique ID value for every entity in your system that is composed
> of only "regular" characters.
>
> Assuming that such an ID exists (and is likely used as one piece of that
> doctorURL that you mentioned) ... if you can extract that ID value into
> its own field (either in your indexing code or a custom update
> processor), you could use that for both uniqueKey and single-document
> lookups.  Having that kind of information in your index seems like a
> generally good idea.
>
> Thanks,
> Shawn
>


Re: Can we query on _version_field ?

Posted by "S.L" <si...@gmail.com>.
Garth and Erick,

I am now successfully able to auto generate ids using UUID
updateRequestProcessorChain , by giving the id type of string .

Thanks for your help folks.

On Thu, Nov 13, 2014 at 1:31 PM, Garth Grimm <
GarthGrimm@averyranchconsulting.com> wrote:

> So it sounds like you’re OK with using the docURL as the unique key for
> routing in SolrCloud, but you don’t want to use it as a lookup mechanism.
>
> If you don’t want to do a hash of it and use that unique value in a second
> unique field and feed time,
> and you can’t seem to find any other field that might be unique,
> and you don’t want to make your own UpdateRequestProcessorChain that would
> generate a unique field from your unique key (such as by doing an MD5 hash),
> you might look at the UpdateRequestProcessorChain named “deduce” in the
> OOB solrconfig.xml.  It’s primarily designed to help dedupe results, but
> it’s technique is to concatenate multiple fields together to create a
> signature that will be unique in some way.  So instead of having to find
> one field in your data that’s unique, you could look for a couple of fields
> that, if combined, would create a unique field, and configure the “dedupe”
> Processor to handle that.
>
>
> > On Nov 13, 2014, at 12:02 PM, S.L <si...@gmail.com> wrote:
> >
> > I am not sure if this a case of XY problem.
> >
> > I have no control over the URLs to deduce an id from them , those are
> from
> > www, I made the URL the uniqueKey , that way the document gets replaced
> > when a new document with that URL comes in .
> >
> > To do the detail look up I can either use the same <docURL> as it is , or
> > try and generate a unique id filed for each document.
> >
> > For the later option UUID is not behaving as expected in SolrCloud and
> > _version_ field seems to be serving the need .
> >
> > On Thu, Nov 13, 2014 at 11:35 AM, Shawn Heisey <ap...@elyograg.org>
> wrote:
> >
> >> On 11/12/2014 10:45 PM, S.L wrote:
> >>> We know that _version_field is a mandatory field in solrcloud
> schema.xml,
> >>> it is expected to be of type long , it also seems to have unique value
> >> in a
> >>> collection.
> >>>
> >>> However the query of the form
> >>>
> >>
> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
> >>> does not seems to return any record , can we query on the
> _version_field
> >> in
> >>> the schema.xml ?
> >>
> >> I've been watching your journey unfold on the mailing list.  The whole
> >> thing seems like an XY problem.
> >>
> >> If I'm reading everything correctly, you want to have a unique ID value
> >> that can serve as the uniqueKey, as well as a way to quickly look up a
> >> single document in Solr.
> >>
> >> Is there one part of the URL that serves as a unique identifier that
> >> doesn't contain special characters?  It seems insane that you would not
> >> have a unique ID value for every entity in your system that is composed
> >> of only "regular" characters.
> >>
> >> Assuming that such an ID exists (and is likely used as one piece of that
> >> doctorURL that you mentioned) ... if you can extract that ID value into
> >> its own field (either in your indexing code or a custom update
> >> processor), you could use that for both uniqueKey and single-document
> >> lookups.  Having that kind of information in your index seems like a
> >> generally good idea.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>

Re: Can we query on _version_field ?

Posted by Garth Grimm <Ga...@averyranchconsulting.com>.
So it sounds like you’re OK with using the docURL as the unique key for routing in SolrCloud, but you don’t want to use it as a lookup mechanism.

If you don’t want to do a hash of it and use that unique value in a second unique field and feed time,
and you can’t seem to find any other field that might be unique,
and you don’t want to make your own UpdateRequestProcessorChain that would generate a unique field from your unique key (such as by doing an MD5 hash),
you might look at the UpdateRequestProcessorChain named “deduce” in the OOB solrconfig.xml.  It’s primarily designed to help dedupe results, but it’s technique is to concatenate multiple fields together to create a signature that will be unique in some way.  So instead of having to find one field in your data that’s unique, you could look for a couple of fields that, if combined, would create a unique field, and configure the “dedupe” Processor to handle that.


> On Nov 13, 2014, at 12:02 PM, S.L <si...@gmail.com> wrote:
> 
> I am not sure if this a case of XY problem.
> 
> I have no control over the URLs to deduce an id from them , those are from
> www, I made the URL the uniqueKey , that way the document gets replaced
> when a new document with that URL comes in .
> 
> To do the detail look up I can either use the same <docURL> as it is , or
> try and generate a unique id filed for each document.
> 
> For the later option UUID is not behaving as expected in SolrCloud and
> _version_ field seems to be serving the need .
> 
> On Thu, Nov 13, 2014 at 11:35 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 11/12/2014 10:45 PM, S.L wrote:
>>> We know that _version_field is a mandatory field in solrcloud schema.xml,
>>> it is expected to be of type long , it also seems to have unique value
>> in a
>>> collection.
>>> 
>>> However the query of the form
>>> 
>> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
>>> does not seems to return any record , can we query on the _version_field
>> in
>>> the schema.xml ?
>> 
>> I've been watching your journey unfold on the mailing list.  The whole
>> thing seems like an XY problem.
>> 
>> If I'm reading everything correctly, you want to have a unique ID value
>> that can serve as the uniqueKey, as well as a way to quickly look up a
>> single document in Solr.
>> 
>> Is there one part of the URL that serves as a unique identifier that
>> doesn't contain special characters?  It seems insane that you would not
>> have a unique ID value for every entity in your system that is composed
>> of only "regular" characters.
>> 
>> Assuming that such an ID exists (and is likely used as one piece of that
>> doctorURL that you mentioned) ... if you can extract that ID value into
>> its own field (either in your indexing code or a custom update
>> processor), you could use that for both uniqueKey and single-document
>> lookups.  Having that kind of information in your index seems like a
>> generally good idea.
>> 
>> Thanks,
>> Shawn
>> 
>> 


Re: Can we query on _version_field ?

Posted by "S.L" <si...@gmail.com>.
I am not sure if this a case of XY problem.

I have no control over the URLs to deduce an id from them , those are from
www, I made the URL the uniqueKey , that way the document gets replaced
when a new document with that URL comes in .

To do the detail look up I can either use the same <docURL> as it is , or
try and generate a unique id filed for each document.

For the later option UUID is not behaving as expected in SolrCloud and
_version_ field seems to be serving the need .

On Thu, Nov 13, 2014 at 11:35 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 11/12/2014 10:45 PM, S.L wrote:
> > We know that _version_field is a mandatory field in solrcloud schema.xml,
> > it is expected to be of type long , it also seems to have unique value
> in a
> > collection.
> >
> > However the query of the form
> >
> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
> > does not seems to return any record , can we query on the _version_field
> in
> > the schema.xml ?
>
> I've been watching your journey unfold on the mailing list.  The whole
> thing seems like an XY problem.
>
> If I'm reading everything correctly, you want to have a unique ID value
> that can serve as the uniqueKey, as well as a way to quickly look up a
> single document in Solr.
>
> Is there one part of the URL that serves as a unique identifier that
> doesn't contain special characters?  It seems insane that you would not
> have a unique ID value for every entity in your system that is composed
> of only "regular" characters.
>
> Assuming that such an ID exists (and is likely used as one piece of that
> doctorURL that you mentioned) ... if you can extract that ID value into
> its own field (either in your indexing code or a custom update
> processor), you could use that for both uniqueKey and single-document
> lookups.  Having that kind of information in your index seems like a
> generally good idea.
>
> Thanks,
> Shawn
>
>

Re: Can we query on _version_field ?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 11/12/2014 10:45 PM, S.L wrote:
> We know that _version_field is a mandatory field in solrcloud schema.xml,
> it is expected to be of type long , it also seems to have unique value in a
> collection.
>
> However the query of the form
> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
> does not seems to return any record , can we query on the _version_field in
> the schema.xml ?

I've been watching your journey unfold on the mailing list.  The whole
thing seems like an XY problem.

If I'm reading everything correctly, you want to have a unique ID value
that can serve as the uniqueKey, as well as a way to quickly look up a
single document in Solr.

Is there one part of the URL that serves as a unique identifier that
doesn't contain special characters?  It seems insane that you would not
have a unique ID value for every entity in your system that is composed
of only "regular" characters.

Assuming that such an ID exists (and is likely used as one piece of that
doctorURL that you mentioned) ... if you can extract that ID value into
its own field (either in your indexing code or a custom update
processor), you could use that for both uniqueKey and single-document
lookups.  Having that kind of information in your index seems like a
generally good idea.

Thanks,
Shawn


Re: Can we query on _version_field ?

Posted by "S.L" <si...@gmail.com>.
Erick,

1."_version_ will change on updates" , shouldnt that be OK  ?My
understanding of update here means that the a new document will be inserted
with the same unique key <docUrl> in my case ,which will replace the
document effectively. This will not be an issue in my case because the
initial search results based on <doctorName>, would have basic doctor data
, and when that tile is  clicked upon detail data would be displayed based
on the lookup of the _version_ id. So if the _version_ does not change
besides the "update"  , I should be good , of course there is a possibility
of the document being "updated" between the search results being displayed
and detailed information being requested, but the possibility of that less
in my case , because usually people request details as soon as the initial
search results are displayed.


2. Yes,I have used UUIDUPdateProcessorFactory  in the following ways , but
none of them solve the issue , especially in SolrCloud.

*Case 1:*

*schema.xml*

        <field name="id" type="string" indexed="true" stored="true"
            required="true" multiValued="false" />

This does not generate the unique id at all.

*Case 2:*

        <field name="id" type="uuid" indexed="true" stored="true"
            required="true" multiValued="false" />

In this case a unique id is generated , but that is unique for every
replica and we end up with different ids for the same document in different
replicas.


In both the cases above the solrconfig.xml had the following entry.

      <updateRequestProcessorChain name="uuid">

        <processor class="solr.UUIDUpdateProcessorFactory">
            <str name="fieldName">id</str>
        </processor>
        <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>



On Thu, Nov 13, 2014 at 11:01 AM, Erick Erickson <er...@gmail.com>
wrote:

> _version_ will change on updates I'm pretty sure, so I doubt
> it's suitable.
>
> I _think_ you can use a UUIDUPdateProcessorFactory here.
> I haven't checked this personally, but the idea here is
> that the UUID cannot be assigned on the shard. But if you're
> checking this out, if the UUID is assigned _before_ the doc
> is sent to the destination shard, it should be fine.
>
> Have you checked that out? I'm at a conference, so I can't
> check it out too thoroughly right now...
>
> Best,
> Erick
>
> On Thu, Nov 13, 2014 at 10:18 AM, S.L <si...@gmail.com> wrote:
> > Here is why I want to do this .
> >
> > 1. My unique key is a http URL, doctorURL.
> > 2. If I do a look up based on URL , I am bound to face issues with
> > character escaping and all.
> > 3. To avoid that I was using a UUID for look up , but in SolrCloud it
> > generates unique per replica , which is not acceptable.
> > 4. Now I see that the mandatory _version_ field has a unique value per
> > document and and not unique per replica , so I am exploring the use of
> > _version_ to do a look up only and not neccesarily use it as a unique
> key,
> > is it do able in that case ?
> >
> > On Thu, Nov 13, 2014 at 8:58 AM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> Really, I have to ask why you would want to. This is really purely an
> >> internal
> >> thing. I don't know what practical value there would be to search on
> this?
> >>
> >> Interestingly, I can search _version_:[1000000 TO *], but specific
> searches
> >> seem to fail.
> >>
> >> I wonder if there's something wonky going on with searching on large
> longs
> >> here.
> >>
> >> Feels like an XY problem to me though.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Nov 13, 2014 at 12:45 AM, S.L <si...@gmail.com>
> wrote:
> >> > Hi All,
> >> >
> >> > We know that _version_field is a mandatory field in solrcloud
> schema.xml,
> >> > it is expected to be of type long , it also seems to have unique value
> >> in a
> >> > collection.
> >> >
> >> > However the query of the form
> >> >
> >>
> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
> >> > does not seems to return any record , can we query on the
> _version_field
> >> in
> >> > the schema.xml ?
> >> >
> >> > Thank you.
> >>
>

Re: Can we query on _version_field ?

Posted by Erick Erickson <er...@gmail.com>.
_version_ will change on updates I'm pretty sure, so I doubt
it's suitable.

I _think_ you can use a UUIDUPdateProcessorFactory here.
I haven't checked this personally, but the idea here is
that the UUID cannot be assigned on the shard. But if you're
checking this out, if the UUID is assigned _before_ the doc
is sent to the destination shard, it should be fine.

Have you checked that out? I'm at a conference, so I can't
check it out too thoroughly right now...

Best,
Erick

On Thu, Nov 13, 2014 at 10:18 AM, S.L <si...@gmail.com> wrote:
> Here is why I want to do this .
>
> 1. My unique key is a http URL, doctorURL.
> 2. If I do a look up based on URL , I am bound to face issues with
> character escaping and all.
> 3. To avoid that I was using a UUID for look up , but in SolrCloud it
> generates unique per replica , which is not acceptable.
> 4. Now I see that the mandatory _version_ field has a unique value per
> document and and not unique per replica , so I am exploring the use of
> _version_ to do a look up only and not neccesarily use it as a unique key,
> is it do able in that case ?
>
> On Thu, Nov 13, 2014 at 8:58 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Really, I have to ask why you would want to. This is really purely an
>> internal
>> thing. I don't know what practical value there would be to search on this?
>>
>> Interestingly, I can search _version_:[1000000 TO *], but specific searches
>> seem to fail.
>>
>> I wonder if there's something wonky going on with searching on large longs
>> here.
>>
>> Feels like an XY problem to me though.
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 13, 2014 at 12:45 AM, S.L <si...@gmail.com> wrote:
>> > Hi All,
>> >
>> > We know that _version_field is a mandatory field in solrcloud schema.xml,
>> > it is expected to be of type long , it also seems to have unique value
>> in a
>> > collection.
>> >
>> > However the query of the form
>> >
>> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
>> > does not seems to return any record , can we query on the _version_field
>> in
>> > the schema.xml ?
>> >
>> > Thank you.
>>

Re: Can we query on _version_field ?

Posted by "S.L" <si...@gmail.com>.
Here is why I want to do this .

1. My unique key is a http URL, doctorURL.
2. If I do a look up based on URL , I am bound to face issues with
character escaping and all.
3. To avoid that I was using a UUID for look up , but in SolrCloud it
generates unique per replica , which is not acceptable.
4. Now I see that the mandatory _version_ field has a unique value per
document and and not unique per replica , so I am exploring the use of
_version_ to do a look up only and not neccesarily use it as a unique key,
is it do able in that case ?

On Thu, Nov 13, 2014 at 8:58 AM, Erick Erickson <er...@gmail.com>
wrote:

> Really, I have to ask why you would want to. This is really purely an
> internal
> thing. I don't know what practical value there would be to search on this?
>
> Interestingly, I can search _version_:[1000000 TO *], but specific searches
> seem to fail.
>
> I wonder if there's something wonky going on with searching on large longs
> here.
>
> Feels like an XY problem to me though.
>
> Best,
> Erick
>
> On Thu, Nov 13, 2014 at 12:45 AM, S.L <si...@gmail.com> wrote:
> > Hi All,
> >
> > We know that _version_field is a mandatory field in solrcloud schema.xml,
> > it is expected to be of type long , it also seems to have unique value
> in a
> > collection.
> >
> > However the query of the form
> >
> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
> > does not seems to return any record , can we query on the _version_field
> in
> > the schema.xml ?
> >
> > Thank you.
>

Re: Can we query on _version_field ?

Posted by Erick Erickson <er...@gmail.com>.
Really, I have to ask why you would want to. This is really purely an internal
thing. I don't know what practical value there would be to search on this?

Interestingly, I can search _version_:[1000000 TO *], but specific searches
seem to fail.

I wonder if there's something wonky going on with searching on large longs here.

Feels like an XY problem to me though.

Best,
Erick

On Thu, Nov 13, 2014 at 12:45 AM, S.L <si...@gmail.com> wrote:
> Hi All,
>
> We know that _version_field is a mandatory field in solrcloud schema.xml,
> it is expected to be of type long , it also seems to have unique value in a
> collection.
>
> However the query of the form
> http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:1484632548944380000%29&wt=json
> does not seems to return any record , can we query on the _version_field in
> the schema.xml ?
>
> Thank you.