You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Patrick Recchia <pa...@gmail.com> on 2018/02/26 22:01:39 UTC

what should go in a docValues field?

Hello all,

I hope I'm in the right place to ask this question.
Seemed to me as this didn't quite qualify for the user's mailing list.
But, if not, feel free to redirect me elsewhere.

The question is: what should we store in a docValues field?

I'm working on the implementation of a InetAddress field type for solr;
leveraging on the' InetAddressPoint sandbox lucene field.
It started as a POC (as we are internally using solr, but feel the need to
index IP Address fields).

It might even qualify as something which might go in the direction of
SOLR-6741; for which I'm planning to offer a patch.

But then I struggle around the following problem:
what should I store within the docValues field?

The problem is that docValues are being used for two different reasons (frm
what I can tell):
- to sort
- and as an optimizatoin when we retrieve only part of the fields of  a set
of documents

Each scenario call for different content.

I see 2 possible options, and am not happy with any of them:

1) string representation:
I could store the string representation of the IP address. e.g. 192.168.1.1
would be stored as "192.168.1.1". as SORTED.
No issue in displaying the field: it is a string. It appears as a string.
Except I would need to normalize it somehow, and then I supose that we
would have 192.168.100.1 < 192.168.11.1 (because '0' < '1').
So wrong choice.

2) binary representation:
I could store the binary representation of the address - which would make
sense because then I could do sorting - really based on the numeric value
of the Address.
So, 192.168.100.1 > 192.168.11.1
But then I face the issue of the representation of the docValues field
(when using fl, for example).
SolrDocumentFetcher.decorateDocValues goes through a long switch to decide
how to render the docValues.
My field (InetAddressType) won't fit into any of these cases.
So I would need to patch also SolrDocumentFetcher to add the representation
for my new Type.
Which I feel a bit odd: I should be able to add a fieldType without the
need to change anyting within the remaining of solr.

So, I see no easy solution neither with 1, nor with 2.
Or is there something I have overlooked, and the solution is actually
pretty simple, but hidden to my (layman) eyes?

Incidentally, I somehow feel that the best would be to have something line
toObject, toExternal, indexedToReadable, etc... for the docValues field,
directly from within FieldType.
That would also clean the long case within the SolrDocumentFetcher.
Again, my feeling? There is, actually, an excellent reason for this long
switch here, rather than a method there?


Thanks to anyone for any clarification.




-- 
One way of describing a computer is as an electric box which hums.
Never ascribe to malice what can be explained by stupidity
--
Patrick Recchia

Re: what should go in a docValues field?

Posted by Adrien Grand <jp...@gmail.com>.

I'm not familiar with how Solr deals with this in details but to me option
2 is the way to go. I also agree with your feeling that there should be a
way to convert the internal representation of doc values to something that
is human-readable.

Le lun. 26 févr. 2018 à 22:02, Patrick Recchia <pa...@gmail.com>
a écrit :

> Hello all,
>
> I hope I'm in the right place to ask this question.
> Seemed to me as this didn't quite qualify for the user's mailing list.
> But, if not, feel free to redirect me elsewhere.
>
> The question is: what should we store in a docValues field?
>
> I'm working on the implementation of a InetAddress field type for solr;
> leveraging on the' InetAddressPoint sandbox lucene field.
> It started as a POC (as we are internally using solr, but feel the need to
> index IP Address fields).
>
> It might even qualify as something which might go in the direction of
> SOLR-6741; for which I'm planning to offer a patch.
>
> But then I struggle around the following problem:
> what should I store within the docValues field?
>
> The problem is that docValues are being used for two different reasons
> (frm what I can tell):
> - to sort
> - and as an optimizatoin when we retrieve only part of the fields of  a
> set of documents
>
> Each scenario call for different content.
>
> I see 2 possible options, and am not happy with any of them:
>
> 1) string representation:
> I could store the string representation of the IP address. e.g.
> 192.168.1.1 would be stored as "192.168.1.1". as SORTED.
> No issue in displaying the field: it is a string. It appears as a string.
> Except I would need to normalize it somehow, and then I supose that we
> would have 192.168.100.1 < 192.168.11.1 (because '0' < '1').
> So wrong choice.
>
> 2) binary representation:
> I could store the binary representation of the address - which would make
> sense because then I could do sorting - really based on the numeric value
> of the Address.
> So, 192.168.100.1 > 192.168.11.1
> But then I face the issue of the representation of the docValues field
> (when using fl, for example).
> SolrDocumentFetcher.decorateDocValues goes through a long switch to decide
> how to render the docValues.
> My field (InetAddressType) won't fit into any of these cases.
> So I would need to patch also SolrDocumentFetcher to add the
> representation for my new Type.
> Which I feel a bit odd: I should be able to add a fieldType without the
> need to change anyting within the remaining of solr.
>
> So, I see no easy solution neither with 1, nor with 2.
> Or is there something I have overlooked, and the solution is actually
> pretty simple, but hidden to my (layman) eyes?
>
> Incidentally, I somehow feel that the best would be to have something line
> toObject, toExternal, indexedToReadable, etc... for the docValues field,
> directly from within FieldType.
> That would also clean the long case within the SolrDocumentFetcher.
> Again, my feeling? There is, actually, an excellent reason for this long
> switch here, rather than a method there?
>
>
> Thanks to anyone for any clarification.
>
>
>
>
> --
> One way of describing a computer is as an electric box which hums.
> Never ascribe to malice what can be explained by stupidity
> --
> Patrick Recchia
>
>