You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Artem Chereisky <a....@gmail.com> on 2010/03/12 05:21:35 UTC

2.9 numeric fields and index size

G'day,

Guys, is it expected that numeric fields take more room on the disk? That is
certainly what I am observing. I had 20 fields per document in my index.
Half of those used to be integers stored as text. As part of moving to 2.9 I
replaced those with Numeric Fields

var intField = new NumericField(fieldName,
NumericUtils.PRECISION_STEP_DEFAULT, Field.Store.YES, true);
intField.SetIntValue(intVal);

I noticed that my index size grew by about 45%. Is that normal?

Regards,
Art

Re: 2.9 numeric fields and index size

Posted by Artem Chereisky <a....@gmail.com>.
Michael, Digy, thanks a lot for helping understand Lucene numerics.
My decision is to go back to strings as I only index integers and search for
single terms.

Art

On Sat, Mar 13, 2010 at 8:29 AM, Michael Garski <mg...@myspace-inc.com>wrote:

> Indexing as string is fine for the case of sorting or queries that are for
> a specific term in the index, where numeric fields show their strength is in
> range queries over numeric values such as dates and geospatial coordinates.
>
> It's a trade-off between the number of terms in the index and number of
> clauses in the query/filter.  Numeric fields give you more terms in the
> index, but fewer terms in creation of a range filter.  String fields give
> you fewer terms in the index however range queries/filters can easily
> explode into an amount that is a drain on performance.
>
> What type of field you choose is dependent on how you want to use it at
> query time.
>
> Michael
>
> -----Original Message-----
> From: Digy [mailto:digydigy@gmail.com]
> Sent: Friday, March 12, 2010 1:22 PM
> To: lucene-net-user@lucene.apache.org
> Subject: RE: 2.9 numeric fields and index size
>
> I've always found it easier to index the data as string whenever possible.
> Same as the numeric data; I mostly pad them with '0's and index as text.
> At the end, Lucene is a full-"text" search engine.
>
> DIGY.
>
> -----Original Message-----
> From: Michael Garski [mailto:mgarski@myspace-inc.com]
> Sent: Friday, March 12, 2010 10:56 PM
> To: lucene-net-user@lucene.apache.org
> Subject: RE: 2.9 numeric fields and index size
>
> Artem,
>
> The use of numeric fields will result in a larger index as what was once
> a single term for a number will now be expanded into multiple terms.
> The number of terms depends on the precision step of the field in
> question, and an explanation of how to calculate the number of terms
> that will be indexed can be found here:
>
> http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/search/Num
> ericRangeQuery.html<http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/search/Num%0AericRangeQuery.html>
>
> A precision step of 4 would give an integer 465 terms, and for a step of
> 2 it would yield 189 terms.
>
> If you don't need to do range queries on the fields you can index them
> as numeric with a precision step of 0 which will give you one term per
> number or using strings.  The advantage to indexing as a number is that
> the terms will be sorted in numerical order as in a numeric field the
> numeric value is encoded as a sortable string.
>
> If you are performing range queries/filters along with your search then
> you certainly want to experiment with the numeric fields and determine
> the precision step value that gives you the best performance for the
> data you are working with.
>
> Michael
>
>
> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Thursday, March 11, 2010 8:22 PM
> To: lucene-net-user@lucene.apache.org
> Subject: 2.9 numeric fields and index size
>
> G'day,
>
> Guys, is it expected that numeric fields take more room on the disk?
> That is
> certainly what I am observing. I had 20 fields per document in my index.
> Half of those used to be integers stored as text. As part of moving to
> 2.9 I
> replaced those with Numeric Fields
>
> var intField = new NumericField(fieldName,
> NumericUtils.PRECISION_STEP_DEFAULT, Field.Store.YES, true);
> intField.SetIntValue(intVal);
>
> I noticed that my index size grew by about 45%. Is that normal?
>
> Regards,
> Art
>
>
>

RE: 2.9 numeric fields and index size

Posted by Michael Garski <mg...@myspace-inc.com>.
Indexing as string is fine for the case of sorting or queries that are for a specific term in the index, where numeric fields show their strength is in range queries over numeric values such as dates and geospatial coordinates.  

It's a trade-off between the number of terms in the index and number of clauses in the query/filter.  Numeric fields give you more terms in the index, but fewer terms in creation of a range filter.  String fields give you fewer terms in the index however range queries/filters can easily explode into an amount that is a drain on performance.

What type of field you choose is dependent on how you want to use it at query time.

Michael

-----Original Message-----
From: Digy [mailto:digydigy@gmail.com] 
Sent: Friday, March 12, 2010 1:22 PM
To: lucene-net-user@lucene.apache.org
Subject: RE: 2.9 numeric fields and index size

I've always found it easier to index the data as string whenever possible.
Same as the numeric data; I mostly pad them with '0's and index as text.
At the end, Lucene is a full-"text" search engine.

DIGY.

-----Original Message-----
From: Michael Garski [mailto:mgarski@myspace-inc.com] 
Sent: Friday, March 12, 2010 10:56 PM
To: lucene-net-user@lucene.apache.org
Subject: RE: 2.9 numeric fields and index size

Artem,

The use of numeric fields will result in a larger index as what was once
a single term for a number will now be expanded into multiple terms.
The number of terms depends on the precision step of the field in
question, and an explanation of how to calculate the number of terms
that will be indexed can be found here:

http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/search/Num
ericRangeQuery.html

A precision step of 4 would give an integer 465 terms, and for a step of
2 it would yield 189 terms.

If you don't need to do range queries on the fields you can index them
as numeric with a precision step of 0 which will give you one term per
number or using strings.  The advantage to indexing as a number is that
the terms will be sorted in numerical order as in a numeric field the
numeric value is encoded as a sortable string.

If you are performing range queries/filters along with your search then
you certainly want to experiment with the numeric fields and determine
the precision step value that gives you the best performance for the
data you are working with.

Michael


-----Original Message-----
From: Artem Chereisky [mailto:a.chereisky@gmail.com] 
Sent: Thursday, March 11, 2010 8:22 PM
To: lucene-net-user@lucene.apache.org
Subject: 2.9 numeric fields and index size

G'day,

Guys, is it expected that numeric fields take more room on the disk?
That is
certainly what I am observing. I had 20 fields per document in my index.
Half of those used to be integers stored as text. As part of moving to
2.9 I
replaced those with Numeric Fields

var intField = new NumericField(fieldName,
NumericUtils.PRECISION_STEP_DEFAULT, Field.Store.YES, true);
intField.SetIntValue(intVal);

I noticed that my index size grew by about 45%. Is that normal?

Regards,
Art



RE: 2.9 numeric fields and index size

Posted by Digy <di...@gmail.com>.
I've always found it easier to index the data as string whenever possible.
Same as the numeric data; I mostly pad them with '0's and index as text.
At the end, Lucene is a full-"text" search engine.

DIGY.

-----Original Message-----
From: Michael Garski [mailto:mgarski@myspace-inc.com] 
Sent: Friday, March 12, 2010 10:56 PM
To: lucene-net-user@lucene.apache.org
Subject: RE: 2.9 numeric fields and index size

Artem,

The use of numeric fields will result in a larger index as what was once
a single term for a number will now be expanded into multiple terms.
The number of terms depends on the precision step of the field in
question, and an explanation of how to calculate the number of terms
that will be indexed can be found here:

http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/search/Num
ericRangeQuery.html

A precision step of 4 would give an integer 465 terms, and for a step of
2 it would yield 189 terms.

If you don't need to do range queries on the fields you can index them
as numeric with a precision step of 0 which will give you one term per
number or using strings.  The advantage to indexing as a number is that
the terms will be sorted in numerical order as in a numeric field the
numeric value is encoded as a sortable string.

If you are performing range queries/filters along with your search then
you certainly want to experiment with the numeric fields and determine
the precision step value that gives you the best performance for the
data you are working with.

Michael


-----Original Message-----
From: Artem Chereisky [mailto:a.chereisky@gmail.com] 
Sent: Thursday, March 11, 2010 8:22 PM
To: lucene-net-user@lucene.apache.org
Subject: 2.9 numeric fields and index size

G'day,

Guys, is it expected that numeric fields take more room on the disk?
That is
certainly what I am observing. I had 20 fields per document in my index.
Half of those used to be integers stored as text. As part of moving to
2.9 I
replaced those with Numeric Fields

var intField = new NumericField(fieldName,
NumericUtils.PRECISION_STEP_DEFAULT, Field.Store.YES, true);
intField.SetIntValue(intVal);

I noticed that my index size grew by about 45%. Is that normal?

Regards,
Art


RE: 2.9 numeric fields and index size

Posted by Michael Garski <mg...@myspace-inc.com>.
Artem,

The use of numeric fields will result in a larger index as what was once
a single term for a number will now be expanded into multiple terms.
The number of terms depends on the precision step of the field in
question, and an explanation of how to calculate the number of terms
that will be indexed can be found here:

http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/search/Num
ericRangeQuery.html

A precision step of 4 would give an integer 465 terms, and for a step of
2 it would yield 189 terms.

If you don't need to do range queries on the fields you can index them
as numeric with a precision step of 0 which will give you one term per
number or using strings.  The advantage to indexing as a number is that
the terms will be sorted in numerical order as in a numeric field the
numeric value is encoded as a sortable string.

If you are performing range queries/filters along with your search then
you certainly want to experiment with the numeric fields and determine
the precision step value that gives you the best performance for the
data you are working with.

Michael


-----Original Message-----
From: Artem Chereisky [mailto:a.chereisky@gmail.com] 
Sent: Thursday, March 11, 2010 8:22 PM
To: lucene-net-user@lucene.apache.org
Subject: 2.9 numeric fields and index size

G'day,

Guys, is it expected that numeric fields take more room on the disk?
That is
certainly what I am observing. I had 20 fields per document in my index.
Half of those used to be integers stored as text. As part of moving to
2.9 I
replaced those with Numeric Fields

var intField = new NumericField(fieldName,
NumericUtils.PRECISION_STEP_DEFAULT, Field.Store.YES, true);
intField.SetIntValue(intVal);

I noticed that my index size grew by about 45%. Is that normal?

Regards,
Art