You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2011/01/11 17:55:51 UTC

[jira] Commented: (LUCENE-830) norms file can become unexpectedly enormous

    [ https://issues.apache.org/jira/browse/LUCENE-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980180#action_12980180 ] 

Grant Ingersoll commented on LUCENE-830:
----------------------------------------

Continuing a thread from IRC...

In Mahout, we have 1 dense vector representation along with a few sparse representations.  In our case, we make users pick up front which representation they want based on what their data looks like and what algs they are running.  The dense vector approach is pretty much just an array of whatever primitive, but the sparse ones are optimized towards either random access or sequential access.   In Lucene's case, we probably could automatically pick an appropriate representation at IndexReader creation based on us keeping track of the density of norms for a given field.

The other thing to consider is we may want to allow people to separate out boosting from length normalization and allow each to be on or off.

> norms file can become unexpectedly enormous
> -------------------------------------------
>
>                 Key: LUCENE-830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-830
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Priority: Minor
>
> Spinoff from this user thread:
>    http://www.gossamer-threads.com/lists/lucene/java-user/46754
> Norms are not stored sparsely, so even if a doc doesn't have field X
> we still use up 1 byte in the norms file (and in memory when that
> field is searched) for that segment.  I think this is done for
> performance at search time?
> For indexes that have a large # documents where each document can have
> wildly varying fields, each segment will use # documents times # fields
> seen in that segment.  When optimize merges all segments, that product
> grows multiplicatively so the norms file for the single segment will
> require far more storage than the sum of all previous segments' norm
> files.
> I think it's uncommon to have a huge number of distinct fields (?) so
> we would need a solution that doesn't hurt the more common case where
> most documents have the same fields.  Maybe something analogous to how
> bitvectors are now optionally stored sparsely?
> One simple workaround is to disable norms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org