You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Chris Hostetter <ho...@rescomp.berkeley.edu> on 2005/10/07 23:18:10 UTC

Eliminating norms ... completley

Yonik and I have been looking at the memory requirements of an application
we've got.  We use a lot of indexed fields, primarily so I can do a lot
of numeric tests (using RangeFilter).   When I say "a lot" I mean
arround 8,000 -- many of which are not used by all documents in the index.

Now there are some basic usage changes I can make to cut this number in
half, and some more complex biz rule changes I can make to get the number
down some more (at the expense of flexibility) but even then we'd have
arround 1,000 -- which is still a lot more then the recommended "handful"

After discussing some options, I asked the question "Remind me again why
having lots of indexed fields makes the memory requirements jump up --
even if only a few documents use some field?" and Yonik reminded me about
the norm[] -- an array of bytes representating the field boost + length
boost for each document.  One of these arrays exists for every indexed
field.

So then I asked the $50,000,000 question:  "Is there any way to get rid of
this array for certain fields? ... or any way to get rid of it completely
for every field in a specific index?"

This may sound like a silly question for most IR applications where you
want length normalization to contribute to your scores, but in this
particular case most of these fields are only used to store single numeric
value, to be certain, there are some fields we have (or may add in the
future) that could benefit from having a narms[] ... but if it had to be
an all or nothing thing we could certainly live without them.

It seems to me, that in an ideal world, deciding wether or not you wanted
to store norms for a field would be like deciding wether you wanted to
store TermVectors for a field.  I can imagine a Field.isNormStored()
method ... but that seems like a pretty significant change to the existing
code base.


Alternately, I started wondering if if would be possible to write our own
IndexReader/IndexWriter subclasses that would ignore the norm info
completely (with maybe an optional list of field names the logic should be
lmited to), and return nothing but fixed values for any parts of the code
base that wanted them.  Looking at SegmentReader and MultiReader this
looked very promising (especailly considering the way SegmentReader uses a
system property to decide which acctaul class ot use).  But I was less
enthusiastic when i started looking at IndexWriter and the DocumentWriter
classes .... there doesn't seem to be any clean way to subclass the
existing code base to eliminate the writing of the norms to the Directory
(curses those final classes, and private final methods).


So I'm curious what you guys think...

  1) Regarding the root problem: is there any other things you can think
     of besides norms[] that would contribute to the memory foot print
     needed by a large number of indexed fields?
  2) Can you think of a clean way for individual applications to eliminate
     norms (via subclassing the lucene code base - ie: no patching)
  3) Yonik is currently looking into what kind of patch it would take to
     optionally turn off norms (I'm not sure if he's looking at doing it
     "per field" or "per index").  Is that the kind of thing that would
     even be considered for getting commited?

--

-------------------------------------------------------------------
"Oh, you're a tricky one."                        Chris M Hostetter
     -- Trisha Weir                    hossman@rescomp.berkeley.edu


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminating norms ... completley

Posted by Chris Hostetter <ho...@fucit.org>.

:   2) Can you think of a clean way for individual applications to eliminate
:      norms (via subclassing the lucene code base - ie: no patching)

For completeness, I should mention that one thing I briefly considered was
writing a new Directory implimentation that would proxy to FSDirectory,
but "fake" any files that had a .f### extension.

I'm not proud of this idea ... but it may be my last resort.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminating norms ... completley

Posted by jian chen <ch...@gmail.com>.

Hi, Chris,

Turning off norm looks like a very interesting problem to me. I remember
that in Lucene Road Map for 2.0, there is a requirement to turn off indexing
for some information, such as proximity.

Maybe optionally turning off the norm could be an experiment to show case
how to turn off the proximity down the road.

Looking at the Lucene source code, it seems to me that the code could be
further improved, bringing it more to the good OO design. For example,
abstract classes could be changed to interfaces if possible, using accessor
methods like getXXX() instead of public member variables, etc.

My hunch is that the changes would add clarity of style to the code and
wouldn't be a real performance drawback.

Just my thoughts. For sake of backward compatibility, these thoughts may not
be that valuable though.

Cheers,

Jian

On 10/7/05, Chris Hostetter <ho...@rescomp.berkeley.edu> wrote:
>
>
> Yonik and I have been looking at the memory requirements of an application
> we've got. We use a lot of indexed fields, primarily so I can do a lot
> of numeric tests (using RangeFilter). When I say "a lot" I mean
> arround 8,000 -- many of which are not used by all documents in the index.
>
> Now there are some basic usage changes I can make to cut this number in
> half, and some more complex biz rule changes I can make to get the number
> down some more (at the expense of flexibility) but even then we'd have
> arround 1,000 -- which is still a lot more then the recommended "handful"
>
> After discussing some options, I asked the question "Remind me again why
> having lots of indexed fields makes the memory requirements jump up --
> even if only a few documents use some field?" and Yonik reminded me about
> the norm[] -- an array of bytes representating the field boost + length
> boost for each document. One of these arrays exists for every indexed
> field.
>
> So then I asked the $50,000,000 question: "Is there any way to get rid of
> this array for certain fields? ... or any way to get rid of it completely
> for every field in a specific index?"
>
> This may sound like a silly question for most IR applications where you
> want length normalization to contribute to your scores, but in this
> particular case most of these fields are only used to store single numeric
> value, to be certain, there are some fields we have (or may add in the
> future) that could benefit from having a narms[] ... but if it had to be
> an all or nothing thing we could certainly live without them.
>
> It seems to me, that in an ideal world, deciding wether or not you wanted
> to store norms for a field would be like deciding wether you wanted to
> store TermVectors for a field. I can imagine a Field.isNormStored()
> method ... but that seems like a pretty significant change to the existing
> code base.
>
>
> Alternately, I started wondering if if would be possible to write our own
> IndexReader/IndexWriter subclasses that would ignore the norm info
> completely (with maybe an optional list of field names the logic should be
> lmited to), and return nothing but fixed values for any parts of the code
> base that wanted them. Looking at SegmentReader and MultiReader this
> looked very promising (especailly considering the way SegmentReader uses a
> system property to decide which acctaul class ot use). But I was less
> enthusiastic when i started looking at IndexWriter and the DocumentWriter
> classes .... there doesn't seem to be any clean way to subclass the
> existing code base to eliminate the writing of the norms to the Directory
> (curses those final classes, and private final methods).
>
>
> So I'm curious what you guys think...
>
> 1) Regarding the root problem: is there any other things you can think
> of besides norms[] that would contribute to the memory foot print
> needed by a large number of indexed fields?
> 2) Can you think of a clean way for individual applications to eliminate
> norms (via subclassing the lucene code base - ie: no patching)
> 3) Yonik is currently looking into what kind of patch it would take to
> optionally turn off norms (I'm not sure if he's looking at doing it
> "per field" or "per index"). Is that the kind of thing that would
> even be considered for getting commited?
>
> --
>
> -------------------------------------------------------------------
> "Oh, you're a tricky one." Chris M Hostetter
> -- Trisha Weir hossman@rescomp.berkeley.edu
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Eliminating norms ... completley

Posted by Chris Hostetter <ho...@fucit.org>.


Paul, thanx for your suggestions.  It seems like they mostly address the
issue of improving search time, by eliminting the need to read the norm
files from disk -- but the spead of the query isn't as big of a concern
for us as the memory footprint.

As I understand it, the point when we are really pushing the limits on
available memory come during optimize - particularly after user behavior
has resulted in deleting/re-adding a large percentage of the documents in
the index.

: For really large indexes the norms might become a bottleneck for
: when building them, but iirc this was improved recently.

Our production system runs 1.4.3 .... perhaps we should try some stress
tests with 1.9 and see what happens.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminating norms ... completley

Posted by Paul Elschot <pa...@xs4all.nl>.

On Friday 07 October 2005 23:18, Chris Hostetter wrote:
> 
...
> So I'm curious what you guys think...
> 
>   1) Regarding the root problem: is there any other things you can think
>      of besides norms[] that would contribute to the memory foot print
>      needed by a large number of indexed fields?

One could use further compressing and/or memory mapped arrays,
but these are not as easy as ignoring the norms on disk altogether.

>   2) Can you think of a clean way for individual applications to eliminate
>      norms (via subclassing the lucene code base - ie: no patching)

See also this thread on auto-filters and boolean fields:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg08007.html

Treating a field as boolean would involve adding a TermScorer that ignores
the field norms. Everything that uses that could also be added, so, in
principle, no patching is needed. At some point, the knowledge about
which field to treat as boolean would have to be inserted, but
I would not know a good place that of the top of my head.

>   3) Yonik is currently looking into what kind of patch it would take to
>      optionally turn off norms (I'm not sure if he's looking at doing it
>      "per field" or "per index").  Is that the kind of thing that would
>      even be considered for getting commited?

For query searching the problem is  the norms in RAM, not on disk.
Treating a field consistently as boolean would avoid reading the
norms from disk.

For really large indexes the norms might become a bottleneck for 
when building them, but iirc this was improved recently.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminating norms ... completley

Posted by Chris Hostetter <ho...@fucit.org>.

: > Doesn't this cause a problem for highly interactive and large indexes? Since
: > every update to the index requires the rewriting of the norms, and
: > constructing a new array.
:
: The original complaint was primarily about search-time memory size, not
: update speed.  I like the proposed patch which addresses both.  I was
: simply noting that there is a way, without patching, to address the
: memory issue.

Sorry for the confusion ... We use a large number of index fields to
facilitate queries, but I was refering to the memory used by our
application on the whole -- which as Robert guessed involes lots of
continuous updates, and periodic optimizes.  From what I can tell, (or I
should say: from what I understand talking to Yonik) the place where the
norms hurt us is during the optimize calls -- particularly after some biz
process has caused a large number of documents to be added to the index.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminating norms ... completley

Posted by Doug Cutting <cu...@apache.org>.

Robert Engels wrote:
> Doesn't this cause a problem for highly interactive and large indexes? Since
> every update to the index requires the rewriting of the norms, and
> constructing a new array.

The original complaint was primarily about search-time memory size, not 
update speed.  I like the proposed patch which addresses both.  I was 
simply noting that there is a way, without patching, to address the 
memory issue.

> How expensive is the maintining of the norms on disk, at least in regards to
> index merging?

Norm processing is simple & fast.  But with thousands of rarely used 
fields it could become a significant factor in update speed.  As with 
all performance questions, the answer requires benchmarking.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Eliminating norms ... completley

Posted by Robert Engels <re...@ix.netcom.com>.

Doesn't this cause a problem for highly interactive and large indexes? Since
every update to the index requires the rewriting of the norms, and
constructing a new array.

How expensive is the maintining of the norms on disk, at least in regards to
index merging?

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Monday, October 10, 2005 2:15 PM
To: java-dev@lucene.apache.org
Subject: Re: Eliminating norms ... completley

Chris Hostetter wrote:
>   2) Can you think of a clean way for individual applications to eliminate
>      norms (via subclassing the lucene code base - ie: no patching)

Can't you simply subclass FilterIndexReader and override norms() to
return a cached dummy array of Similarity.encodeNorm(1.0f) for those
fields whose norms you don't want?  You'd still have to have a single
array of bytes, but no longer one per field.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminating norms ... completley

Posted by Doug Cutting <cu...@apache.org>.

Chris Hostetter wrote:
>   2) Can you think of a clean way for individual applications to eliminate
>      norms (via subclassing the lucene code base - ie: no patching)

Can't you simply subclass FilterIndexReader and override norms() to 
return a cached dummy array of Similarity.encodeNorm(1.0f) for those 
fields whose norms you don't want?  You'd still have to have a single 
array of bytes, but no longer one per field.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Eliminating norms ... completley

Posted by Yonik Seeley <ys...@gmail.com>.

Here's my patch for the indexing and IndexReader side of things. Works like
a charm.

http://issues.apache.org/jira/browse/LUCENE-448

This doesn't yet include performance enhancements to scorers when norms
aren't included,
but I first wanted to make sure things were completely backward compatible
with custom queries and scorers that we don't have access to.

-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 10/7/05, Yonik Seeley <ys...@gmail.com> wrote:
>
> I'm approaching it the same as term vectors... make the norms optional on
> a per field basis.
> My first cut is returning a dummy norm array filled in with the equiv of
> 1.0f so I didn't have to go modify all the queries/weights/scorers that
> retrieve the norms. For best performance, those scorers should be modified
> to handle null.
>

Re: Eliminating norms ... completley

Posted by Yonik Seeley <ys...@gmail.com>.

I'm approaching it the same as term vectors... make the norms optional on a
per field basis.
My first cut is returning a dummy norm array filled in with the equiv of
1.0f so I didn't have to go modify all the queries/weights/scorers that
retrieve the norms. For best performance, those scorers should be modified
to handle null.

-Yonik

On 10/7/05, Robert Engels <re...@ix.netcom.com> wrote:
>
> I did exactly this in my custom lucene, since the array of a byte per
> document is extremely wasteful in a lot of applications. I just changed
> the
> code to return null from getNorms() and modified the callers to treat a
> null
> array as always 1 for any document.
>

RE: Eliminating norms ... completley

Posted by Robert Engels <re...@ix.netcom.com>.

I did exactly this in my custom lucene, since the array of a byte per
document is extremely wasteful in a lot of applications. I just changed the
code to return null from getNorms() and modified the callers to treat a null
array as always 1 for any document.

-----Original Message-----
From: Chris Hostetter [mailto:hossman@rescomp.berkeley.edu]
Sent: Friday, October 07, 2005 4:18 PM
To: java-dev@lucene.apache.org
Subject: Eliminating norms ... completley

Yonik and I have been looking at the memory requirements of an application
we've got.  We use a lot of indexed fields, primarily so I can do a lot
of numeric tests (using RangeFilter).   When I say "a lot" I mean
arround 8,000 -- many of which are not used by all documents in the index.

Now there are some basic usage changes I can make to cut this number in
half, and some more complex biz rule changes I can make to get the number
down some more (at the expense of flexibility) but even then we'd have
arround 1,000 -- which is still a lot more then the recommended "handful"

After discussing some options, I asked the question "Remind me again why
having lots of indexed fields makes the memory requirements jump up --
even if only a few documents use some field?" and Yonik reminded me about
the norm[] -- an array of bytes representating the field boost + length
boost for each document.  One of these arrays exists for every indexed
field.

So then I asked the $50,000,000 question:  "Is there any way to get rid of
this array for certain fields? ... or any way to get rid of it completely
for every field in a specific index?"

This may sound like a silly question for most IR applications where you
want length normalization to contribute to your scores, but in this
particular case most of these fields are only used to store single numeric
value, to be certain, there are some fields we have (or may add in the
future) that could benefit from having a narms[] ... but if it had to be
an all or nothing thing we could certainly live without them.

It seems to me, that in an ideal world, deciding wether or not you wanted
to store norms for a field would be like deciding wether you wanted to
store TermVectors for a field.  I can imagine a Field.isNormStored()
method ... but that seems like a pretty significant change to the existing
code base.

Alternately, I started wondering if if would be possible to write our own
IndexReader/IndexWriter subclasses that would ignore the norm info
completely (with maybe an optional list of field names the logic should be
lmited to), and return nothing but fixed values for any parts of the code
base that wanted them.  Looking at SegmentReader and MultiReader this
looked very promising (especailly considering the way SegmentReader uses a
system property to decide which acctaul class ot use).  But I was less
enthusiastic when i started looking at IndexWriter and the DocumentWriter
classes .... there doesn't seem to be any clean way to subclass the
existing code base to eliminate the writing of the norms to the Directory
(curses those final classes, and private final methods).

So I'm curious what you guys think...

  1) Regarding the root problem: is there any other things you can think
     of besides norms[] that would contribute to the memory foot print
     needed by a large number of indexed fields?
  2) Can you think of a clean way for individual applications to eliminate
     norms (via subclassing the lucene code base - ie: no patching)
  3) Yonik is currently looking into what kind of patch it would take to
     optionally turn off norms (I'm not sure if he's looking at doing it
     "per field" or "per index").  Is that the kind of thing that would
     even be considered for getting commited?

--

-------------------------------------------------------------------
"Oh, you're a tricky one."                        Chris M Hostetter
     -- Trisha Weir                    hossman@rescomp.berkeley.edu

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org