You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by David Smiley <da...@gmail.com> on 2016/03/24 15:34:50 UTC

Lucene FieldType & specifying numeric type (double, float, )

With the move to PointValues and away from trie based indexing of the terms
index, for numerics, everything associated with the trie stuff seems to be
labelled as "Legacy" and marked deprecated.  Even FieldType.NumericType
(now FieldType.LegacyNumericType) -- a simple enum of INT, LONG, FLOAT,
DOUBLE.  I wonder if we ought to reconsider doing this for
FieldType.NumericType, as it articulates the type of numeric data; it need
not be associated with just trie indexing of terms data; it could
articulate how any numeric data is encoded, be it docValues or
pointValues.  This is useful metadata.  It's not strictly required, true,
but its useful in describing what goes in the field.  This makes a
FieldType instance fairly self-sufficient.  Otherwise, say you have
docValue numerics and/or pointValues, it's ambiguous how the data should be
interpreted.  This doesn't lead to a bug but would help debugging and
allowing APIs to express field requirements simply by providing a FieldType
instance for numeric data.  It used to be self sufficient but now if we
imagine the legacy stuff being removed, it's ambiguous.  In addition, it
would be useful metadata if it found it's way into FieldInfo.  Then, say
Luke, could help you know what's there and maybe search it.

Thoughts?

~ David
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Robert Muir <rc...@gmail.com>.
This is not recorded into fieldinfo.

For points lucene treats the data as a multidimensional byte[]. Its up
to higher level indexing classes (Field) and search classes (Query) to
deal with various ways to encode information in that byte.

There is a lot more that can go in here than just INT, LONG, FLOAT,
DOUBLE. See the sandbox which indexes BigInteger, InetAddress, etc and
users should be able to extend it to other data types.

The FieldType.LegacyNumericType is brain damage and is rightfully removed.

On Thu, Mar 24, 2016 at 10:34 AM, David Smiley <da...@gmail.com> wrote:
> With the move to PointValues and away from trie based indexing of the terms
> index, for numerics, everything associated with the trie stuff seems to be
> labelled as "Legacy" and marked deprecated.  Even FieldType.NumericType (now
> FieldType.LegacyNumericType) -- a simple enum of INT, LONG, FLOAT, DOUBLE.
> I wonder if we ought to reconsider doing this for FieldType.NumericType, as
> it articulates the type of numeric data; it need not be associated with just
> trie indexing of terms data; it could articulate how any numeric data is
> encoded, be it docValues or pointValues.  This is useful metadata.  It's not
> strictly required, true, but its useful in describing what goes in the
> field.  This makes a FieldType instance fairly self-sufficient.  Otherwise,
> say you have docValue numerics and/or pointValues, it's ambiguous how the
> data should be interpreted.  This doesn't lead to a bug but would help
> debugging and allowing APIs to express field requirements simply by
> providing a FieldType instance for numeric data.  It used to be self
> sufficient but now if we imagine the legacy stuff being removed, it's
> ambiguous.  In addition, it would be useful metadata if it found it's way
> into FieldInfo.  Then, say Luke, could help you know what's there and maybe
> search it.
>
> Thoughts?
>
> ~ David
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Robert Muir <rc...@gmail.com>.
Again we dont care except that its bytes. FieldInfo already records
the same thing we recorded for legacy shit:

the length in bytes.

That is all legacy numerics ever knew before (simply from the length
of the term), if it was 32 or 64 bits. It could not differentiate
integer from float, you could never do that.

nothing was removed, there is no feature to keep here.

On Thu, Mar 24, 2016 at 10:42 AM, David Smiley <da...@gmail.com> wrote:
> I should add, it we keep FieldType.NumericType and use it as I suggest, it
> would either need a new enum value of "UNSPECIFIED" (think IPV6 or other
> custom uses) or null; I'd prefer to avoid the null.
> ~ David
>
> On Thu, Mar 24, 2016 at 10:34 AM David Smiley <da...@gmail.com>
> wrote:
>>
>> With the move to PointValues and away from trie based indexing of the
>> terms index, for numerics, everything associated with the trie stuff seems
>> to be labelled as "Legacy" and marked deprecated.  Even
>> FieldType.NumericType (now FieldType.LegacyNumericType) -- a simple enum of
>> INT, LONG, FLOAT, DOUBLE.  I wonder if we ought to reconsider doing this for
>> FieldType.NumericType, as it articulates the type of numeric data; it need
>> not be associated with just trie indexing of terms data; it could articulate
>> how any numeric data is encoded, be it docValues or pointValues.  This is
>> useful metadata.  It's not strictly required, true, but its useful in
>> describing what goes in the field.  This makes a FieldType instance fairly
>> self-sufficient.  Otherwise, say you have docValue numerics and/or
>> pointValues, it's ambiguous how the data should be interpreted.  This
>> doesn't lead to a bug but would help debugging and allowing APIs to express
>> field requirements simply by providing a FieldType instance for numeric
>> data.  It used to be self sufficient but now if we imagine the legacy stuff
>> being removed, it's ambiguous.  In addition, it would be useful metadata if
>> it found it's way into FieldInfo.  Then, say Luke, could help you know
>> what's there and maybe search it.
>>
>> Thoughts?
>>
>> ~ David
>> --
>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>> http://www.solrenterprisesearchserver.com
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by David Smiley <da...@gmail.com>.
I should add, it we keep FieldType.NumericType and use it as I suggest, it
would either need a new enum value of "UNSPECIFIED" (think IPV6 or other
custom uses) or null; I'd prefer to avoid the null.
~ David

On Thu, Mar 24, 2016 at 10:34 AM David Smiley <da...@gmail.com>
wrote:

> With the move to PointValues and away from trie based indexing of the
> terms index, for numerics, everything associated with the trie stuff seems
> to be labelled as "Legacy" and marked deprecated.  Even
> FieldType.NumericType (now FieldType.LegacyNumericType) -- a simple enum of
> INT, LONG, FLOAT, DOUBLE.  I wonder if we ought to reconsider doing this
> for FieldType.NumericType, as it articulates the type of numeric data; it
> need not be associated with just trie indexing of terms data; it could
> articulate how any numeric data is encoded, be it docValues or
> pointValues.  This is useful metadata.  It's not strictly required, true,
> but its useful in describing what goes in the field.  This makes a
> FieldType instance fairly self-sufficient.  Otherwise, say you have
> docValue numerics and/or pointValues, it's ambiguous how the data should be
> interpreted.  This doesn't lead to a bug but would help debugging and
> allowing APIs to express field requirements simply by providing a FieldType
> instance for numeric data.  It used to be self sufficient but now if we
> imagine the legacy stuff being removed, it's ambiguous.  In addition, it
> would be useful metadata if it found it's way into FieldInfo.  Then, say
> Luke, could help you know what's there and maybe search it.
>
> Thoughts?
>
> ~ David
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Yonik Seeley <ys...@gmail.com>.
On Thu, Mar 24, 2016 at 11:28 AM, Jack Krupansky
<ja...@gmail.com> wrote:
> Yeah, I do recall seeing LUCENE-6917 (Deprecate and rename
> NumericField/RangeQuery to LegacyNumeric) go by in the Jira traffic

It was also mere weeks between this deprecation (which did not address
Solr), and the proposal to start the lucene/solr 6 release process,
virtually ensuring that Solr would be on deprecated numeric types for
6.0

Of course given that the release process has apparently stalled and
development of the Point stuff is continuing, it seems like the
deprecation was premature.

This would also seem to mark the end of the ability to upgrade indexes
without reindexing (unless the IndexUpgrader will acquire the ability
to migrate from old numerics to new numerics).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Ryan Ernst <ry...@iernst.net>.
Scalar doesnt mean anything. Point is simple, it is a point in n
dimensional space, that is what the data structure provides for fast
searching on. Numbers are points in one dimensional space. Think of a
number line.
On Mar 24, 2016 8:37 AM, "David Smiley" <da...@gmail.com> wrote:

> bq. it wasn't at all clear that the intention was that simple scalars
> would now and forever henceforth be referred to as "points". My impression
> at the time was that the focus of the Jira was on implementation and
> storage level indexing detail rather than the user-facing API level. I see
> now that I was wrong about that. It just seems to me that there should have
> been a more direct public discussion of eliminating the concept of scalar
> values at the API level.
>
> I knew because I was following closely, but otherwise I agree with your
> sentiment.  I don't love the "PointValues" terminology either nor did I
> like "DimensionalValues"; I should have suggested alternatives at the time
> but the Mike & Rob tag-team were working so fast that I didn't interject in
> the narrow window of time before a patch was put up with the current
> names.  More time to publicly discuss would have been better.  FWIW I like
> your suggestion for "Scalar"; that's more meaningful to me.  Naming is hard.
>
> ~ David
>
> On Thu, Mar 24, 2016 at 11:28 AM Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> I wasn't paying close attention when this whole PointValues saga was
>> unfolding. I get the value of points for spatial data, but conflating the
>> terms "point" and "numeric" is bizarre to say the least. Reading the code,
>> I see "Points represent numeric values", which seems nonsensical to me. A
>> little later the code comment says "Geospatial Point Types - Although basic
>> point types such as DoublePoint support points in multi-dimensional space
>> too, Lucene has specialized classes for location data...", which continues
>> this odd use of terminology. I mean, aren't all points spatial by
>> definition, so that "Geospatial Point" is redundant? It would make more
>> sense to speak of a point as a geospatial number, or that a point is
>> represented by numbers.
>>
>> IOW, NumericValues would make more sense as the base, with (spatial)
>> PointValues derived from the base of numeric values. At least to me that
>> would make more sense.
>>
>> As the PointValues was progressing I had no idea that its intent was to
>> subsume, replace, or deprecate traditional scalar numeric value support in
>> Lucene (or Solr.) It came across primarily as being an improvement for
>> spatial search.
>>
>> Not that I have any objection to greatly improved storage in Lucene, but
>> to now have to speak of all numeric data as points seems quite... weird.
>>
>> Sure, I saw the Jira traffic, like LUCENE-6825 (Add multidimensional
>> byte[] indexing support to Lucene) and LUCENE-6852 (Add DimensionalFormat
>> to Codec), but in all honesty that really did come across as relating to
>> purely spatial data and not being applicable to basic scalar number support.
>>
>> Looking at CHANGES.TXT, I see references like "LUCENE-6852, LUCENE-6975:
>> Add support for points (dimensionally indexed values)", but without any
>> hint that the intent was to subsume or replace non-dimensional numeric
>> indexed values.
>>
>> Now for all I know, non-dimensional (scalar) numeric data can very
>> efficiently be handled as if it had dimension, but that's not exactly
>> obvious and warrants at least some illumination. In traditional terminology
>> a point is 0-dimension (a line is 1-dimension, and a plane is 2-dimension),
>> but traditionally a raw number - a scalar - hasn't been referred to as
>> having dimension, so that is a new concept warranting clear definition.
>>
>> Yeah, I do recall seeing LUCENE-6917 (Deprecate and rename
>> NumericField/RangeQuery to LegacyNumeric) go by in the Jira traffic, and
>> shame on me for not reading the details more carefully, but it wasn't at
>> all clear that the intention was that simple scalars would now and forever
>> henceforth be referred to as "points". My impression at the time was that
>> the focus of the Jira was on implementation and storage level indexing
>> detail rather than the user-facing API level. I see now that I was wrong
>> about that. It just seems to me that there should have been a more direct
>> public discussion of eliminating the concept of scalar values at the API
>> level.
>>
>> (I wonder what physics would be like if they started referring to scalar
>> quantities as vectors.)
>>
>> My apologies for the rant.
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Mar 24, 2016 at 10:34 AM, David Smiley <da...@gmail.com>
>> wrote:
>>
>>> With the move to PointValues and away from trie based indexing of the
>>> terms index, for numerics, everything associated with the trie stuff seems
>>> to be labelled as "Legacy" and marked deprecated.  Even
>>> FieldType.NumericType (now FieldType.LegacyNumericType) -- a simple enum of
>>> INT, LONG, FLOAT, DOUBLE.  I wonder if we ought to reconsider doing this
>>> for FieldType.NumericType, as it articulates the type of numeric data; it
>>> need not be associated with just trie indexing of terms data; it could
>>> articulate how any numeric data is encoded, be it docValues or
>>> pointValues.  This is useful metadata.  It's not strictly required, true,
>>> but its useful in describing what goes in the field.  This makes a
>>> FieldType instance fairly self-sufficient.  Otherwise, say you have
>>> docValue numerics and/or pointValues, it's ambiguous how the data should be
>>> interpreted.  This doesn't lead to a bug but would help debugging and
>>> allowing APIs to express field requirements simply by providing a FieldType
>>> instance for numeric data.  It used to be self sufficient but now if we
>>> imagine the legacy stuff being removed, it's ambiguous.  In addition, it
>>> would be useful metadata if it found it's way into FieldInfo.  Then, say
>>> Luke, could help you know what's there and maybe search it.
>>>
>>> Thoughts?
>>>
>>> ~ David
>>> --
>>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>>> http://www.solrenterprisesearchserver.com
>>>
>>
>> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>

Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Joel Bernstein <jo...@gmail.com>.
Thanks Robert, sounds good. And I'll give the blog post a read Mike.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 24, 2016 at 12:51 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> See also my recent blog post describing this new feature:
> https://www.elastic.co/blog/lucene-points-6.0
>
> Net/net, in the 1D case, points looks like a win across the board vs.
> the legacy (postings) implementation.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Mar 24, 2016 at 12:33 PM, Robert Muir <rc...@gmail.com> wrote:
> > On Thu, Mar 24, 2016 at 12:16 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
> >> I'm pretty confused about points as well and until very recently thought
> >> these we geo-spacial improvements only.
> >>
> >> It would be good to understand the mechanics of points versus numerics.
> I'm
> >> particularly interested in not losing the high performance numeric
> DocValues
> >> support, which has become so important for analytics.
> >>
> >
> > Unrelated. points are the structure used to find matching documents
> > from e.g. a query point, range, radius, shape, whatever. They use a
> > tree-like structure for this. So the replacement for NumericRangeQuery
> > which "simulates" a tree with an inverted index.
> >
> > Instead of inverted index+postings list, we just have a proper tree
> > structure for these things: fixed-width, multidimensional values. It
> > has a different indexreader api for example, that lets you control how
> > the tree is traversed as it goes (by returning INSIDE [collect all the
> > docids in here blindly, this entire tree range is relevant], OUTSIDE
> > [not relevant to my query, don't traverse this region anymore], or
> > CROSSES [i may or may not be interested, have to traverse further to
> > nodes (sub-ranges or values themselves)].
> >
> > They also have the advantage of not being limited to 64 bits or 1
> > dimension, you can have up to 128 bits and up to 8 dimensions. So each
> > thing you are adding to your document is really a "point in
> > n-dimensional space", so if you want to have 3 lat+long pairs as a
> > double[] in a single field, that works as you expect.
> >
> > See more information here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/PointValues.java#L35-L79
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Jack Krupansky <ja...@gmail.com>.
Thanks, Mike. I see the prompt commits!

1. I wasn't suggesting to revert to the old implementation of IntField,
just reusing the name - simply renaming IntPoint to IntField.

2. Since my previous message I see that you (and others) have been also
using the term "dimensionalValues" (not points) in some Jiras related to
this work, so the terminology use does need to get cleaned up:
https://issues.apache.org/jira/browse/LUCENE-6917
https://issues.apache.org/jira/browse/SOLR-8396

3. I've also added some comments on a related Solr Jira that intersect with
the Lucene points stuff that you might want to chime in on:
https://issues.apache.org/jira/browse/SOLR-8396

4. The main question on all of this - my points - is whether any of the
senior committers (especially Solr) wish to elevate the importance of any
of these points from my modest level of rambling.

With that, I think I'm done on this topic... for now.

-- Jack Krupansky

On Fri, Mar 25, 2016 at 8:12 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Fri, Mar 25, 2016 at 6:23 PM, Jack Krupansky
> <ja...@gmail.com> wrote:
>
> > Mike, thanks for that blog post link.
>
> You're welcome!
>
> > (Please let me know if this discussion should be moved elsewhere, either
> to
> > Jira or a fresh thread, although it seems germane to David's original
> > inquiry, at least a little.)
>
> Here seems good.
>
> > 1. You need to update the post a little, like the change for
> ExactPointQuery
> > that occurred on 2/20, a few days after your postt:
> > https://issues.apache.org/jira/browse/LUCENE-7039
> >
> > In particular, now we have IntPoint.newExactQuery(field, value),
> > IntPoint.newRangeQuery(field, lowerValue, upperValue), etc.
>
> Thanks, but I likely won't update it ... stuff changes over time ;)
> If I spent time updating my old posts I would never get anything else
> done!
>
> And the post does state that points are unreleased / subject to change,
> iirc.
>
> > 2. Note that as in the actual API, those are "values", not "points." In
> > fact, the Javadoc says "Create a query for matching an exact integer
> value"
> > and "Create a range query for integer values."
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/IntPoint.java
>
> Yeah, I guess we sometimes say value for 1D points?
>
> > 3. The class declared as "class IntPoint extends Field", which feels a
> bit
> > odd without adding any useful info. I mean, why isn't IntPoint extending
> > Point? And these real are fields, not points. I'd suggest sticking with
> > "IntField". I mean, the Javadoc does say "An indexed {@code int} field."
> > Ditto for the other numeric XxxPoint classes.
>
> Well, Field is the base class Lucene uses for "that which you add to a
> Document for indexing", so we pretty much have to subclass it.
>
> I don't think we should just "re-use" the previous IntField: that
> would be very trappy, the implementation is very different, you can
> index 2D points, etc.
>
> > 4. I didn't notice a DatePoint class in the Lucene search package. I'm
> sure
> > it's floating around somewhere, but it does seem odd that it's not...
> right
> > there with Int, Float, Double, et al.
>
> You're right, patches welcome!
>
> > 5. It would help people to speak of a numeric field as a "space" which
> > happens to be a 1-dimensional line (redundant there!), so that the value
> in
> > a numeric field is then effectively a "point" in that 1D space. That's if
> > we're going to stick with this conception of simple, scalar, numeric
> fields
> > as being "points", but I think it makes more sense to speak of numeric
> > fields with dimensionality, like 2D/3D dimensional int/float/double
> field.
> > The n numeric values do happen to correspond to a "point" when n>1, but
> at
> > the API level they seem to be dimensional values. I mean, even for 2D and
> > 3D, the Javadoc for Int/Float/DoublePoint.newRangeQuery says "Create a
> range
> > query for n-dimensional integer values."
>
> Not sure what you're saying here.
>
> > 6. Your post refers to "a new feature called dimensional points", but
> that
> > term doesn't seem to be used commonly in the code or Jira (just a couple
> of
> > references, but not in titles.) Besides, it seems redundant - I mean,
> when
> > does a point not have dimensionality? I would suggest renaming that to
> > "dimensional values" or dimValues, rather than "points." Or, maybe just
> > abstractly as "dimensional fields" to indicate that numeric fields
> support
> > multiple dimensions now. To me, it feels like there should be a
> > DimensionalField derived from Field that is used as base for IntField, et
> > al, to reinforce the dimensionality and provide a common base in the
> > Javadoc, or other places in the code that wish to reference to fields
> that
> > are either dimensional or numeric. Or, maybe it should just be
> NumericField?
>
> I think "points" (sounds like N dims) is more correct for the general
> feature name and its related classes, than "value" (sounds like 1D).
>
> > 7. I see a minor bug in an exception:
> >
> >     if (lowerPoint.length != upperPoint.length) {
> >       throw new IllegalArgumentException("lowerPoint has length=" +
> numDims
> > + " but upperPoint has different length=" + upperPoint.length);
> >     }
> >
> > numDims should be lowerPoint.length. For a simple Int"Point" (Field!)
> then
> > length would be 4 but numDims would be 1.
>
> Thanks!  I'll go fix that.
>
> > 8. I was a little disappointed that a point query wasn't a lot faster
> than
> > trie field. I mean, 25% is decent, but I would have imagined that all of
> > this work would have resulted in more like a 400% gain in speed. Is the
> > current implementation master considered optimal or does it have a lot of
> > room for improvement? Also, is this for an indexed primarily cached in OS
> > system memory or primarily accessed with I/O? And, I'm curious whether
> exact
> > point and narrow range queries (e.g., trying to select less than 0.25% of
> > indexed documents) are indeed only 25% faster than trie.
>
> I would love a 1000% percent speedup, but it was what it was on that
> day that I tested ;)  I'll take 25% and faster indexing, much less
> heap, etc.
>
> It's most analogous to postings: primarily IO (sequential, in the 1D
> case), so, yeah you want those pages to be hot in the OS's IO cache.
>
> There have already been lots of changes since then, maybe the number
> is different now.
>
> Maybe a different benchmark gives different results.  Benchmarks welcome!
>
> And, yes, I'm sure there are improvements still to make.  Various devs
> have been doing so intensely for the past few weeks.  Patches welcome!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Fri, Mar 25, 2016 at 6:23 PM, Jack Krupansky
<ja...@gmail.com> wrote:

> Mike, thanks for that blog post link.

You're welcome!

> (Please let me know if this discussion should be moved elsewhere, either to
> Jira or a fresh thread, although it seems germane to David's original
> inquiry, at least a little.)

Here seems good.

> 1. You need to update the post a little, like the change for ExactPointQuery
> that occurred on 2/20, a few days after your postt:
> https://issues.apache.org/jira/browse/LUCENE-7039
>
> In particular, now we have IntPoint.newExactQuery(field, value),
> IntPoint.newRangeQuery(field, lowerValue, upperValue), etc.

Thanks, but I likely won't update it ... stuff changes over time ;)
If I spent time updating my old posts I would never get anything else
done!

And the post does state that points are unreleased / subject to change, iirc.

> 2. Note that as in the actual API, those are "values", not "points." In
> fact, the Javadoc says "Create a query for matching an exact integer value"
> and "Create a range query for integer values."
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/IntPoint.java

Yeah, I guess we sometimes say value for 1D points?

> 3. The class declared as "class IntPoint extends Field", which feels a bit
> odd without adding any useful info. I mean, why isn't IntPoint extending
> Point? And these real are fields, not points. I'd suggest sticking with
> "IntField". I mean, the Javadoc does say "An indexed {@code int} field."
> Ditto for the other numeric XxxPoint classes.

Well, Field is the base class Lucene uses for "that which you add to a
Document for indexing", so we pretty much have to subclass it.

I don't think we should just "re-use" the previous IntField: that
would be very trappy, the implementation is very different, you can
index 2D points, etc.

> 4. I didn't notice a DatePoint class in the Lucene search package. I'm sure
> it's floating around somewhere, but it does seem odd that it's not... right
> there with Int, Float, Double, et al.

You're right, patches welcome!

> 5. It would help people to speak of a numeric field as a "space" which
> happens to be a 1-dimensional line (redundant there!), so that the value in
> a numeric field is then effectively a "point" in that 1D space. That's if
> we're going to stick with this conception of simple, scalar, numeric fields
> as being "points", but I think it makes more sense to speak of numeric
> fields with dimensionality, like 2D/3D dimensional int/float/double field.
> The n numeric values do happen to correspond to a "point" when n>1, but at
> the API level they seem to be dimensional values. I mean, even for 2D and
> 3D, the Javadoc for Int/Float/DoublePoint.newRangeQuery says "Create a range
> query for n-dimensional integer values."

Not sure what you're saying here.

> 6. Your post refers to "a new feature called dimensional points", but that
> term doesn't seem to be used commonly in the code or Jira (just a couple of
> references, but not in titles.) Besides, it seems redundant - I mean, when
> does a point not have dimensionality? I would suggest renaming that to
> "dimensional values" or dimValues, rather than "points." Or, maybe just
> abstractly as "dimensional fields" to indicate that numeric fields support
> multiple dimensions now. To me, it feels like there should be a
> DimensionalField derived from Field that is used as base for IntField, et
> al, to reinforce the dimensionality and provide a common base in the
> Javadoc, or other places in the code that wish to reference to fields that
> are either dimensional or numeric. Or, maybe it should just be NumericField?

I think "points" (sounds like N dims) is more correct for the general
feature name and its related classes, than "value" (sounds like 1D).

> 7. I see a minor bug in an exception:
>
>     if (lowerPoint.length != upperPoint.length) {
>       throw new IllegalArgumentException("lowerPoint has length=" + numDims
> + " but upperPoint has different length=" + upperPoint.length);
>     }
>
> numDims should be lowerPoint.length. For a simple Int"Point" (Field!) then
> length would be 4 but numDims would be 1.

Thanks!  I'll go fix that.

> 8. I was a little disappointed that a point query wasn't a lot faster than
> trie field. I mean, 25% is decent, but I would have imagined that all of
> this work would have resulted in more like a 400% gain in speed. Is the
> current implementation master considered optimal or does it have a lot of
> room for improvement? Also, is this for an indexed primarily cached in OS
> system memory or primarily accessed with I/O? And, I'm curious whether exact
> point and narrow range queries (e.g., trying to select less than 0.25% of
> indexed documents) are indeed only 25% faster than trie.

I would love a 1000% percent speedup, but it was what it was on that
day that I tested ;)  I'll take 25% and faster indexing, much less
heap, etc.

It's most analogous to postings: primarily IO (sequential, in the 1D
case), so, yeah you want those pages to be hot in the OS's IO cache.

There have already been lots of changes since then, maybe the number
is different now.

Maybe a different benchmark gives different results.  Benchmarks welcome!

And, yes, I'm sure there are improvements still to make.  Various devs
have been doing so intensely for the past few weeks.  Patches welcome!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Jack Krupansky <ja...@gmail.com>.
Mike, thanks for that blog post link. I just read it, and looked at some
code. Thanks to your post I can at least pretend to feel that I know a
little bit about what has been going on! I even know now what BKD refers to
(Block K-D tree), and that it simultaneously is a replacement for Trie
fields and multi-dimensional.

(Please let me know if this discussion should be moved elsewhere, either to
Jira or a fresh thread, although it seems germane to David's original
inquiry, at least a little.)

1. You need to update the post a little, like the change for
ExactPointQuery that occurred on 2/20, a few days after your postt:
https://issues.apache.org/jira/browse/LUCENE-7039

In particular, now we have IntPoint.newExactQuery(field, value),
IntPoint.newRangeQuery(field, lowerValue, upperValue), etc.

2. Note that as in the actual API, those are "values", not "points." In
fact, the Javadoc says "Create a query for matching an exact integer value"
and "Create a range query for integer values."
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/IntPoint.java

3. The class declared as "class IntPoint extends Field", which feels a bit
odd without adding any useful info. I mean, why isn't IntPoint extending
Point? And these real are fields, not points. I'd suggest sticking with
"IntField". I mean, the Javadoc does say "An indexed {@code int} field."
Ditto for the other numeric XxxPoint classes.

4. I didn't notice a DatePoint class in the Lucene search package. I'm sure
it's floating around somewhere, but it does seem odd that it's not... right
there with Int, Float, Double, et al.

5. It would help people to speak of a numeric field as a "space" which
happens to be a 1-dimensional line (redundant there!), so that the value in
a numeric field is then effectively a "point" in that 1D space. That's if
we're going to stick with this conception of simple, scalar, numeric fields
as being "points", but I think it makes more sense to speak of numeric
fields with dimensionality, like 2D/3D dimensional int/float/double field.
The n numeric values do happen to correspond to a "point" when n>1, but at
the API level they seem to be dimensional values. I mean, even for 2D and
3D, the Javadoc for Int/Float/DoublePoint.newRangeQuery says "Create a
range query for n-dimensional integer values."

6. Your post refers to "a new feature called *dimensional points
<https://issues.apache.org/jira/browse/LUCENE-6852>*", but that term
doesn't seem to be used commonly in the code or Jira (just a couple of
references, but not in titles.) Besides, it seems redundant - I mean, when
does a point not have dimensionality? I would suggest renaming that to
"dimensional values" or dimValues, rather than "points." Or, maybe just
abstractly as "dimensional fields" to indicate that numeric fields support
multiple dimensions now. To me, it feels like there should be a
DimensionalField derived from Field that is used as base for IntField, et
al, to reinforce the dimensionality and provide a common base in the
Javadoc, or other places in the code that wish to reference to fields that
are either dimensional or numeric. Or, maybe it should just be NumericField?

7. I see a minor bug in an exception:

    if (lowerPoint.length != upperPoint.length) {
      throw new IllegalArgumentException("lowerPoint has length=" + numDims
+ " but upperPoint has different length=" + upperPoint.length);
    }

numDims should be lowerPoint.length. For a simple Int"Point" (Field!) then
length would be 4 but numDims would be 1.

See:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java

8. I was a little disappointed that a point query wasn't a lot faster than
trie field. I mean, 25% is decent, but I would have imagined that all of
this work would have resulted in more like a 400% gain in speed. Is the
current implementation master considered optimal or does it have a lot of
room for improvement? Also, is this for an indexed primarily cached in OS
system memory or primarily accessed with I/O? And, I'm curious whether
exact point and narrow range queries (e.g., trying to select less than
0.25% of indexed documents) are indeed only 25% faster than trie.

My apologies for my limited depth of comprehension on all of this new work.


-- Jack Krupansky

On Thu, Mar 24, 2016 at 12:51 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> See also my recent blog post describing this new feature:
> https://www.elastic.co/blog/lucene-points-6.0
>
> Net/net, in the 1D case, points looks like a win across the board vs.
> the legacy (postings) implementation.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Mar 24, 2016 at 12:33 PM, Robert Muir <rc...@gmail.com> wrote:
> > On Thu, Mar 24, 2016 at 12:16 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
> >> I'm pretty confused about points as well and until very recently thought
> >> these we geo-spacial improvements only.
> >>
> >> It would be good to understand the mechanics of points versus numerics.
> I'm
> >> particularly interested in not losing the high performance numeric
> DocValues
> >> support, which has become so important for analytics.
> >>
> >
> > Unrelated. points are the structure used to find matching documents
> > from e.g. a query point, range, radius, shape, whatever. They use a
> > tree-like structure for this. So the replacement for NumericRangeQuery
> > which "simulates" a tree with an inverted index.
> >
> > Instead of inverted index+postings list, we just have a proper tree
> > structure for these things: fixed-width, multidimensional values. It
> > has a different indexreader api for example, that lets you control how
> > the tree is traversed as it goes (by returning INSIDE [collect all the
> > docids in here blindly, this entire tree range is relevant], OUTSIDE
> > [not relevant to my query, don't traverse this region anymore], or
> > CROSSES [i may or may not be interested, have to traverse further to
> > nodes (sub-ranges or values themselves)].
> >
> > They also have the advantage of not being limited to 64 bits or 1
> > dimension, you can have up to 128 bits and up to 8 dimensions. So each
> > thing you are adding to your document is really a "point in
> > n-dimensional space", so if you want to have 3 lat+long pairs as a
> > double[] in a single field, that works as you expect.
> >
> > See more information here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/PointValues.java#L35-L79
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Michael McCandless <lu...@mikemccandless.com>.
See also my recent blog post describing this new feature:
https://www.elastic.co/blog/lucene-points-6.0

Net/net, in the 1D case, points looks like a win across the board vs.
the legacy (postings) implementation.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Mar 24, 2016 at 12:33 PM, Robert Muir <rc...@gmail.com> wrote:
> On Thu, Mar 24, 2016 at 12:16 PM, Joel Bernstein <jo...@gmail.com> wrote:
>> I'm pretty confused about points as well and until very recently thought
>> these we geo-spacial improvements only.
>>
>> It would be good to understand the mechanics of points versus numerics. I'm
>> particularly interested in not losing the high performance numeric DocValues
>> support, which has become so important for analytics.
>>
>
> Unrelated. points are the structure used to find matching documents
> from e.g. a query point, range, radius, shape, whatever. They use a
> tree-like structure for this. So the replacement for NumericRangeQuery
> which "simulates" a tree with an inverted index.
>
> Instead of inverted index+postings list, we just have a proper tree
> structure for these things: fixed-width, multidimensional values. It
> has a different indexreader api for example, that lets you control how
> the tree is traversed as it goes (by returning INSIDE [collect all the
> docids in here blindly, this entire tree range is relevant], OUTSIDE
> [not relevant to my query, don't traverse this region anymore], or
> CROSSES [i may or may not be interested, have to traverse further to
> nodes (sub-ranges or values themselves)].
>
> They also have the advantage of not being limited to 64 bits or 1
> dimension, you can have up to 128 bits and up to 8 dimensions. So each
> thing you are adding to your document is really a "point in
> n-dimensional space", so if you want to have 3 lat+long pairs as a
> double[] in a single field, that works as you expect.
>
> See more information here:
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/PointValues.java#L35-L79
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Mar 24, 2016 at 12:16 PM, Joel Bernstein <jo...@gmail.com> wrote:
> I'm pretty confused about points as well and until very recently thought
> these we geo-spacial improvements only.
>
> It would be good to understand the mechanics of points versus numerics. I'm
> particularly interested in not losing the high performance numeric DocValues
> support, which has become so important for analytics.
>

Unrelated. points are the structure used to find matching documents
from e.g. a query point, range, radius, shape, whatever. They use a
tree-like structure for this. So the replacement for NumericRangeQuery
which "simulates" a tree with an inverted index.

Instead of inverted index+postings list, we just have a proper tree
structure for these things: fixed-width, multidimensional values. It
has a different indexreader api for example, that lets you control how
the tree is traversed as it goes (by returning INSIDE [collect all the
docids in here blindly, this entire tree range is relevant], OUTSIDE
[not relevant to my query, don't traverse this region anymore], or
CROSSES [i may or may not be interested, have to traverse further to
nodes (sub-ranges or values themselves)].

They also have the advantage of not being limited to 64 bits or 1
dimension, you can have up to 128 bits and up to 8 dimensions. So each
thing you are adding to your document is really a "point in
n-dimensional space", so if you want to have 3 lat+long pairs as a
double[] in a single field, that works as you expect.

See more information here:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/PointValues.java#L35-L79

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Joel Bernstein <jo...@gmail.com>.
I'm pretty confused about points as well and until very recently thought
these we geo-spacial improvements only.

It would be good to understand the mechanics of points versus numerics. I'm
particularly interested in not losing the high performance numeric
DocValues support, which has become so important for analytics.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 24, 2016 at 11:37 AM, David Smiley <da...@gmail.com>
wrote:

> bq. it wasn't at all clear that the intention was that simple scalars
> would now and forever henceforth be referred to as "points". My impression
> at the time was that the focus of the Jira was on implementation and
> storage level indexing detail rather than the user-facing API level. I see
> now that I was wrong about that. It just seems to me that there should have
> been a more direct public discussion of eliminating the concept of scalar
> values at the API level.
>
> I knew because I was following closely, but otherwise I agree with your
> sentiment.  I don't love the "PointValues" terminology either nor did I
> like "DimensionalValues"; I should have suggested alternatives at the time
> but the Mike & Rob tag-team were working so fast that I didn't interject in
> the narrow window of time before a patch was put up with the current
> names.  More time to publicly discuss would have been better.  FWIW I like
> your suggestion for "Scalar"; that's more meaningful to me.  Naming is hard.
>
> ~ David
>
> On Thu, Mar 24, 2016 at 11:28 AM Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> I wasn't paying close attention when this whole PointValues saga was
>> unfolding. I get the value of points for spatial data, but conflating the
>> terms "point" and "numeric" is bizarre to say the least. Reading the code,
>> I see "Points represent numeric values", which seems nonsensical to me. A
>> little later the code comment says "Geospatial Point Types - Although basic
>> point types such as DoublePoint support points in multi-dimensional space
>> too, Lucene has specialized classes for location data...", which continues
>> this odd use of terminology. I mean, aren't all points spatial by
>> definition, so that "Geospatial Point" is redundant? It would make more
>> sense to speak of a point as a geospatial number, or that a point is
>> represented by numbers.
>>
>> IOW, NumericValues would make more sense as the base, with (spatial)
>> PointValues derived from the base of numeric values. At least to me that
>> would make more sense.
>>
>> As the PointValues was progressing I had no idea that its intent was to
>> subsume, replace, or deprecate traditional scalar numeric value support in
>> Lucene (or Solr.) It came across primarily as being an improvement for
>> spatial search.
>>
>> Not that I have any objection to greatly improved storage in Lucene, but
>> to now have to speak of all numeric data as points seems quite... weird.
>>
>> Sure, I saw the Jira traffic, like LUCENE-6825 (Add multidimensional
>> byte[] indexing support to Lucene) and LUCENE-6852 (Add DimensionalFormat
>> to Codec), but in all honesty that really did come across as relating to
>> purely spatial data and not being applicable to basic scalar number support.
>>
>> Looking at CHANGES.TXT, I see references like "LUCENE-6852, LUCENE-6975:
>> Add support for points (dimensionally indexed values)", but without any
>> hint that the intent was to subsume or replace non-dimensional numeric
>> indexed values.
>>
>> Now for all I know, non-dimensional (scalar) numeric data can very
>> efficiently be handled as if it had dimension, but that's not exactly
>> obvious and warrants at least some illumination. In traditional terminology
>> a point is 0-dimension (a line is 1-dimension, and a plane is 2-dimension),
>> but traditionally a raw number - a scalar - hasn't been referred to as
>> having dimension, so that is a new concept warranting clear definition.
>>
>> Yeah, I do recall seeing LUCENE-6917 (Deprecate and rename
>> NumericField/RangeQuery to LegacyNumeric) go by in the Jira traffic, and
>> shame on me for not reading the details more carefully, but it wasn't at
>> all clear that the intention was that simple scalars would now and forever
>> henceforth be referred to as "points". My impression at the time was that
>> the focus of the Jira was on implementation and storage level indexing
>> detail rather than the user-facing API level. I see now that I was wrong
>> about that. It just seems to me that there should have been a more direct
>> public discussion of eliminating the concept of scalar values at the API
>> level.
>>
>> (I wonder what physics would be like if they started referring to scalar
>> quantities as vectors.)
>>
>> My apologies for the rant.
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Mar 24, 2016 at 10:34 AM, David Smiley <da...@gmail.com>
>> wrote:
>>
>>> With the move to PointValues and away from trie based indexing of the
>>> terms index, for numerics, everything associated with the trie stuff seems
>>> to be labelled as "Legacy" and marked deprecated.  Even
>>> FieldType.NumericType (now FieldType.LegacyNumericType) -- a simple enum of
>>> INT, LONG, FLOAT, DOUBLE.  I wonder if we ought to reconsider doing this
>>> for FieldType.NumericType, as it articulates the type of numeric data; it
>>> need not be associated with just trie indexing of terms data; it could
>>> articulate how any numeric data is encoded, be it docValues or
>>> pointValues.  This is useful metadata.  It's not strictly required, true,
>>> but its useful in describing what goes in the field.  This makes a
>>> FieldType instance fairly self-sufficient.  Otherwise, say you have
>>> docValue numerics and/or pointValues, it's ambiguous how the data should be
>>> interpreted.  This doesn't lead to a bug but would help debugging and
>>> allowing APIs to express field requirements simply by providing a FieldType
>>> instance for numeric data.  It used to be self sufficient but now if we
>>> imagine the legacy stuff being removed, it's ambiguous.  In addition, it
>>> would be useful metadata if it found it's way into FieldInfo.  Then, say
>>> Luke, could help you know what's there and maybe search it.
>>>
>>> Thoughts?
>>>
>>> ~ David
>>> --
>>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>>> http://www.solrenterprisesearchserver.com
>>>
>>
>> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>

Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by David Smiley <da...@gmail.com>.
bq. it wasn't at all clear that the intention was that simple scalars would
now and forever henceforth be referred to as "points". My impression at the
time was that the focus of the Jira was on implementation and storage level
indexing detail rather than the user-facing API level. I see now that I was
wrong about that. It just seems to me that there should have been a more
direct public discussion of eliminating the concept of scalar values at the
API level.

I knew because I was following closely, but otherwise I agree with your
sentiment.  I don't love the "PointValues" terminology either nor did I
like "DimensionalValues"; I should have suggested alternatives at the time
but the Mike & Rob tag-team were working so fast that I didn't interject in
the narrow window of time before a patch was put up with the current
names.  More time to publicly discuss would have been better.  FWIW I like
your suggestion for "Scalar"; that's more meaningful to me.  Naming is hard.

~ David

On Thu, Mar 24, 2016 at 11:28 AM Jack Krupansky <ja...@gmail.com>
wrote:

> I wasn't paying close attention when this whole PointValues saga was
> unfolding. I get the value of points for spatial data, but conflating the
> terms "point" and "numeric" is bizarre to say the least. Reading the code,
> I see "Points represent numeric values", which seems nonsensical to me. A
> little later the code comment says "Geospatial Point Types - Although basic
> point types such as DoublePoint support points in multi-dimensional space
> too, Lucene has specialized classes for location data...", which continues
> this odd use of terminology. I mean, aren't all points spatial by
> definition, so that "Geospatial Point" is redundant? It would make more
> sense to speak of a point as a geospatial number, or that a point is
> represented by numbers.
>
> IOW, NumericValues would make more sense as the base, with (spatial)
> PointValues derived from the base of numeric values. At least to me that
> would make more sense.
>
> As the PointValues was progressing I had no idea that its intent was to
> subsume, replace, or deprecate traditional scalar numeric value support in
> Lucene (or Solr.) It came across primarily as being an improvement for
> spatial search.
>
> Not that I have any objection to greatly improved storage in Lucene, but
> to now have to speak of all numeric data as points seems quite... weird.
>
> Sure, I saw the Jira traffic, like LUCENE-6825 (Add multidimensional
> byte[] indexing support to Lucene) and LUCENE-6852 (Add DimensionalFormat
> to Codec), but in all honesty that really did come across as relating to
> purely spatial data and not being applicable to basic scalar number support.
>
> Looking at CHANGES.TXT, I see references like "LUCENE-6852, LUCENE-6975:
> Add support for points (dimensionally indexed values)", but without any
> hint that the intent was to subsume or replace non-dimensional numeric
> indexed values.
>
> Now for all I know, non-dimensional (scalar) numeric data can very
> efficiently be handled as if it had dimension, but that's not exactly
> obvious and warrants at least some illumination. In traditional terminology
> a point is 0-dimension (a line is 1-dimension, and a plane is 2-dimension),
> but traditionally a raw number - a scalar - hasn't been referred to as
> having dimension, so that is a new concept warranting clear definition.
>
> Yeah, I do recall seeing LUCENE-6917 (Deprecate and rename
> NumericField/RangeQuery to LegacyNumeric) go by in the Jira traffic, and
> shame on me for not reading the details more carefully, but it wasn't at
> all clear that the intention was that simple scalars would now and forever
> henceforth be referred to as "points". My impression at the time was that
> the focus of the Jira was on implementation and storage level indexing
> detail rather than the user-facing API level. I see now that I was wrong
> about that. It just seems to me that there should have been a more direct
> public discussion of eliminating the concept of scalar values at the API
> level.
>
> (I wonder what physics would be like if they started referring to scalar
> quantities as vectors.)
>
> My apologies for the rant.
>
>
> -- Jack Krupansky
>
> On Thu, Mar 24, 2016 at 10:34 AM, David Smiley <da...@gmail.com>
> wrote:
>
>> With the move to PointValues and away from trie based indexing of the
>> terms index, for numerics, everything associated with the trie stuff seems
>> to be labelled as "Legacy" and marked deprecated.  Even
>> FieldType.NumericType (now FieldType.LegacyNumericType) -- a simple enum of
>> INT, LONG, FLOAT, DOUBLE.  I wonder if we ought to reconsider doing this
>> for FieldType.NumericType, as it articulates the type of numeric data; it
>> need not be associated with just trie indexing of terms data; it could
>> articulate how any numeric data is encoded, be it docValues or
>> pointValues.  This is useful metadata.  It's not strictly required, true,
>> but its useful in describing what goes in the field.  This makes a
>> FieldType instance fairly self-sufficient.  Otherwise, say you have
>> docValue numerics and/or pointValues, it's ambiguous how the data should be
>> interpreted.  This doesn't lead to a bug but would help debugging and
>> allowing APIs to express field requirements simply by providing a FieldType
>> instance for numeric data.  It used to be self sufficient but now if we
>> imagine the legacy stuff being removed, it's ambiguous.  In addition, it
>> would be useful metadata if it found it's way into FieldInfo.  Then, say
>> Luke, could help you know what's there and maybe search it.
>>
>> Thoughts?
>>
>> ~ David
>> --
>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>> http://www.solrenterprisesearchserver.com
>>
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Lucene FieldType & specifying numeric type (double, float, )

Posted by Jack Krupansky <ja...@gmail.com>.
I wasn't paying close attention when this whole PointValues saga was
unfolding. I get the value of points for spatial data, but conflating the
terms "point" and "numeric" is bizarre to say the least. Reading the code,
I see "Points represent numeric values", which seems nonsensical to me. A
little later the code comment says "Geospatial Point Types - Although basic
point types such as DoublePoint support points in multi-dimensional space
too, Lucene has specialized classes for location data...", which continues
this odd use of terminology. I mean, aren't all points spatial by
definition, so that "Geospatial Point" is redundant? It would make more
sense to speak of a point as a geospatial number, or that a point is
represented by numbers.

IOW, NumericValues would make more sense as the base, with (spatial)
PointValues derived from the base of numeric values. At least to me that
would make more sense.

As the PointValues was progressing I had no idea that its intent was to
subsume, replace, or deprecate traditional scalar numeric value support in
Lucene (or Solr.) It came across primarily as being an improvement for
spatial search.

Not that I have any objection to greatly improved storage in Lucene, but to
now have to speak of all numeric data as points seems quite... weird.

Sure, I saw the Jira traffic, like LUCENE-6825 (Add multidimensional byte[]
indexing support to Lucene) and LUCENE-6852 (Add DimensionalFormat to
Codec), but in all honesty that really did come across as relating to
purely spatial data and not being applicable to basic scalar number support.

Looking at CHANGES.TXT, I see references like "LUCENE-6852, LUCENE-6975:
Add support for points (dimensionally indexed values)", but without any
hint that the intent was to subsume or replace non-dimensional numeric
indexed values.

Now for all I know, non-dimensional (scalar) numeric data can very
efficiently be handled as if it had dimension, but that's not exactly
obvious and warrants at least some illumination. In traditional terminology
a point is 0-dimension (a line is 1-dimension, and a plane is 2-dimension),
but traditionally a raw number - a scalar - hasn't been referred to as
having dimension, so that is a new concept warranting clear definition.

Yeah, I do recall seeing LUCENE-6917 (Deprecate and rename
NumericField/RangeQuery to LegacyNumeric) go by in the Jira traffic, and
shame on me for not reading the details more carefully, but it wasn't at
all clear that the intention was that simple scalars would now and forever
henceforth be referred to as "points". My impression at the time was that
the focus of the Jira was on implementation and storage level indexing
detail rather than the user-facing API level. I see now that I was wrong
about that. It just seems to me that there should have been a more direct
public discussion of eliminating the concept of scalar values at the API
level.

(I wonder what physics would be like if they started referring to scalar
quantities as vectors.)

My apologies for the rant.


-- Jack Krupansky

On Thu, Mar 24, 2016 at 10:34 AM, David Smiley <da...@gmail.com>
wrote:

> With the move to PointValues and away from trie based indexing of the
> terms index, for numerics, everything associated with the trie stuff seems
> to be labelled as "Legacy" and marked deprecated.  Even
> FieldType.NumericType (now FieldType.LegacyNumericType) -- a simple enum of
> INT, LONG, FLOAT, DOUBLE.  I wonder if we ought to reconsider doing this
> for FieldType.NumericType, as it articulates the type of numeric data; it
> need not be associated with just trie indexing of terms data; it could
> articulate how any numeric data is encoded, be it docValues or
> pointValues.  This is useful metadata.  It's not strictly required, true,
> but its useful in describing what goes in the field.  This makes a
> FieldType instance fairly self-sufficient.  Otherwise, say you have
> docValue numerics and/or pointValues, it's ambiguous how the data should be
> interpreted.  This doesn't lead to a bug but would help debugging and
> allowing APIs to express field requirements simply by providing a FieldType
> instance for numeric data.  It used to be self sufficient but now if we
> imagine the legacy stuff being removed, it's ambiguous.  In addition, it
> would be useful metadata if it found it's way into FieldInfo.  Then, say
> Luke, could help you know what's there and maybe search it.
>
> Thoughts?
>
> ~ David
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>