You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Toke Eskildsen <te...@statsbiblioteket.dk> on 2014/12/10 15:12:24 UTC

Determining NumericType for a field

I am attempting to write some code for removing or adding DocValues for
an existing Lucene index: https://github.com/netarchivesuite/dvenabler
I have a proof of concept running, but it is not very user friendly.

Ideally the user should be presented with a list of fields and simply
select which ones should have DocValues. However, in order to do so, I
need to determine is a NumericField was indexed as INT, LONG, FLOAT or
DOUBLE.

That information is present in FieldType at index time, but I cannot
figure out if it is possible to extract it from an existing index?
If it not possible to determine with certainty, I could use a way of
performing a best-guess.

On a similar note, does Lucene have a concept of single and multi-value
stored fields or do I have to infer that by iterating all the documents
and check each one?

- Toke Eskildsen, State and University Library, Denmark



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Determining NumericType for a field

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Mon, 2014-12-15 at 14:23 +0100, david.w.smiley@gmail.com wrote:

Toke:
>         Down to practicalities, we need Lucene 4.8 as our DocValues
>         are Disk
>         based and that support was removed in 4.9.

> I assume you’re referring to the “Disk” DV format/Codec?  The standard
> format has the data on disk too, it’s just that there’s some
> “small” (relative to the disk data) lookup references in heap/memory
> whereas the codec you’re using doesn’t.  Are you sure the standard
> codec isn’t sufficient?

As we have not tried anything else than "Disk" for our Net Archive
index, we have no comparison with "standard" (or whatever it is called).
We have no real preference and our next shards will be build with
"standard". Only reason for "Disk" is that it seemed like a good idea at
the time and now we have 20TB of index with it.

We would like to convert away from "Disk" too, but we would like to
avoid having to do a two-pass upgrade ("Disk" -> "standard" followed by
"non-DV" -> "DV"), so the DVEnabling code should preferably support
"Disk" for reading and do it all as single-pass.

>   If your use-case shows that there’s a need for the disk codec, I
> think it could be brought back, perhaps into the codecs module.

I think the removal of "Disk" during a minor version increase was not in
line with the backwards compatibility spirit of Solr. But I am sure it
was marked "Experimental" somewhere in the code and that the removal
obeyed the stated rules.

Anyway, done is done and as we have no future need for "Disk". But
thanks for the suggested fix.

>   You could copy the code too to use newer Lucene versions…

We looked at that sometime back and the code tentacles reached too far
for us to dare grapple with.

Regards,
Toke Eskildsen, State and University Library, Denmark




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Determining NumericType for a field

Posted by "david.w.smiley@gmail.com" <da...@gmail.com>.
> Down to practicalities, we need Lucene 4.8 as our DocValues are Disk
> based and that support was removed in 4.9.


I assume you’re referring to the “Disk” DV format/Codec?  The standard
format has the data on disk too, it’s just that there’s some “small”
(relative to the disk data) lookup references in heap/memory whereas the
codec you’re using doesn’t.  Are you sure the standard codec isn’t
sufficient?  If your use-case shows that there’s a need for the disk codec,
I think it could be brought back, perhaps into the codecs module.  You
could copy the code too to use newer Lucene versions… although I recall
some push vs pull API changes so I don’t know what it would take to bring
it up to date.  I’m curious what Rob Muir says about this.

~ David

Re: Determining NumericType for a field

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Mon, 2014-12-15 at 11:33 +0100, Michael McCandless wrote:
> On Mon, Dec 15, 2014 at 4:53 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:

[Toke: Limit on faceting with many references]

> Hmm that's probably the DocTermOrds 16 MB internal addressing limit?

Yes, we've hit that one before. If we did not have DocValues, I would
consider it a serious deficiency of Solr.

For one of the fields in the shard I tested, we had 675M references from
256M documents to 3M unique values, with the most popular value having
18M references.

(all of which works perfectly fine & fast with DocValues, yay!)

[2 days for conversion of 900GB index]

> That's awful.  Profile it?  But, how long did it take to index in the
> first place?

Full index takes 8 days with 24 CPUs going full tilt ~=192 CPU days.
Conversion is (sadly) single threaded, so measured in total CPU time, it
is just the 2 days. Still, we can't scale parallel conversions of
multiple shards very high due to limited local storage space.

I'll put a lot more timing debug logging into the code to investigate
where the time is spend.

[TestDemoParallelLeafReader]

> The DVs can be arbitrary (not just long); it's only that the test
> cases focuses on long.

My point was that there does not seem to be any auto-guessing of field
type (especially NumericsType for numeric values) in the code. Anyway,
since that would not guarantee correct results, it seems that it is
better anyway to require the user to be specific about what should
happen.

> Have a look @ the LUCENE-6005 branch: I broke this test out as a
> separate ReindexingReader + test.  I think we could do a better
> integration between that and the schema...

Down to practicalities, we need Lucene 4.8 as our DocValues are Disk
based and that support was removed in 4.9. I hope to find the time to
look at your better solution in January.

Regards,
Toke Eskildsen, State and University Library, Denmark



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Determining NumericType for a field

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Mon, Dec 15, 2014 at 4:53 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:

>> In the meantime, maybe you could model your tool after
>> UninvertingReader?  It faces the same issue (lack of schema) and lets
>> the user specify the type.
>
> Yes, that is what we're doing. Unfortunately we cannot use the
> UninvertingReader directly due to its restrictions on facet structure
> size: We have too many references in our shards so it hits an internal
> 16M(?) limit.

Hmm that's probably the DocTermOrds 16 MB internal addressing limit?

> Unfortunately our current mapping code from stored multi value String to
> DocValues seems to be much very slow: It took nearly 2 days to convert a
> single-segment 900GB index, where a standard optimize is only 8 hours.

That's awful.  Profile it?  But, how long did it take to index in the
first place?

>> Also, see (the confusingly named) TestDemoParallelLeafReader?  It lets
>> you partially reindex, e.g. derive new indexed fields or DV fields,
>> etc., from existing stored/DV fields, in an NRT manner.
>
> Thanks for the pointer. As far as I can see, the demo is very explicit
> about the type of DocValues being long, so no auto-guessing there. It's
> a very interesting idea though, with seamless DV-enabling.

The DVs can be arbitrary (not just long); it's only that the test
cases focuses on long.

Have a look @ the LUCENE-6005 branch: I broke this test out as a
separate ReindexingReader + test.  I think we could do a better
integration between that and the schema...

I also added a simpler "testSwitchToDocValues" test case.  It still
uses only long DVs but you can easily see how you could do other types
to ... I'll add an example of SortedSet.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Determining NumericType for a field

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2014-12-10 at 15:27 +0100, Michael McCandless wrote:
> No, Lucene does not store numeric type nor multi-valued-ness today;
> it's frustrating.

At least I now know not to dig too deep for non-existing answers,
thanks. Out current code requires the user to be explicit about how the
content of the fields should be treated. Until a more fundamental
change, such as LUCENE-6005, we will leave it at that.

> In the meantime, maybe you could model your tool after
> UninvertingReader?  It faces the same issue (lack of schema) and lets
> the user specify the type.

Yes, that is what we're doing. Unfortunately we cannot use the
UninvertingReader directly due to its restrictions on facet structure
size: We have too many references in our shards so it hits an internal
16M(?) limit. 

Unfortunately our current mapping code from stored multi value String to
DocValues seems to be much very slow: It took nearly 2 days to convert a
single-segment 900GB index, where a standard optimize is only 8 hours.

> Also, see (the confusingly named) TestDemoParallelLeafReader?  It lets
> you partially reindex, e.g. derive new indexed fields or DV fields,
> etc., from existing stored/DV fields, in an NRT manner.

Thanks for the pointer. As far as I can see, the demo is very explicit
about the type of DocValues being long, so no auto-guessing there. It's
a very interesting idea though, with seamless DV-enabling.

Thank you,
Toke Eskildsen, State and University Library, Denmark



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Determining NumericType for a field

Posted by Michael McCandless <lu...@mikemccandless.com>.
No, Lucene does not store numeric type nor multi-valued-ness today;
it's frustrating.

In LUCENE-6005 branch I'm exploring fixing that, and it's going well,
but there are many challenges/nocommits.

In the meantime, maybe you could model your tool after
UninvertingReader?  It faces the same issue (lack of schema) and lets
the user specify the type.

Also, see (the confusingly named) TestDemoParallelLeafReader?  It lets
you partially reindex, e.g. derive new indexed fields or DV fields,
etc., from existing stored/DV fields, in an NRT manner.



Mike McCandless

http://blog.mikemccandless.com


On Wed, Dec 10, 2014 at 9:12 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
> I am attempting to write some code for removing or adding DocValues for
> an existing Lucene index: https://github.com/netarchivesuite/dvenabler
> I have a proof of concept running, but it is not very user friendly.
>
> Ideally the user should be presented with a list of fields and simply
> select which ones should have DocValues. However, in order to do so, I
> need to determine is a NumericField was indexed as INT, LONG, FLOAT or
> DOUBLE.
>
> That information is present in FieldType at index time, but I cannot
> figure out if it is possible to extract it from an existing index?
> If it not possible to determine with certainty, I could use a way of
> performing a best-guess.
>
> On a similar note, does Lucene have a concept of single and multi-value
> stored fields or do I have to infer that by iterating all the documents
> and check each one?
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org