You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/03/13 15:14:09 UTC

[jira] Created: (LUCENE-830) norms file can become unexpectedly enormous

norms file can become unexpectedly enormous
-------------------------------------------

                 Key: LUCENE-830
                 URL: https://issues.apache.org/jira/browse/LUCENE-830
             Project: Lucene - Java
          Issue Type: Bug
          Components: Index
    Affects Versions: 2.1
            Reporter: Michael McCandless
            Priority: Minor



Spinoff from this user thread:

   http://www.gossamer-threads.com/lists/lucene/java-user/46754

Norms are not stored sparsely, so even if a doc doesn't have field X
we still use up 1 byte in the norms file (and in memory when that
field is searched) for that segment.  I think this is done for
performance at search time?

For indexes that have a large # documents where each document can have
wildly varying fields, each segment will use # documents times # fields
seen in that segment.  When optimize merges all segments, that product
grows multiplicatively so the norms file for the single segment will
require far more storage than the sum of all previous segments' norm
files.

I think it's uncommon to have a huge number of distinct fields (?) so
we would need a solution that doesn't hurt the more common case where
most documents have the same fields.  Maybe something analogous to how
bitvectors are now optionally stored sparsely?

One simple workaround is to disable norms.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-830) norms file can become unexpectedly enormous

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480520 ] 

Doron Cohen commented on LUCENE-830:
------------------------------------

> One simple workaround is to disable norms. 

You mean for some of the fields, using Fieldable's setOmitNorms().

For large indexes, I would think that most fields would be indexed with omit=true, except for one (content) or two (subject?) fields were length normalization and/or boosting are of importance. in such cases there would not really be a problem.

Consider the example that an index created for adding textual search to a database application, by mapping the index field names to the database "textual columns" names; if more than one table is indexed, but the textual column name happens to be different between the tables, then yes, - with that straightforward mapping there would be a waste - lots of unused bytes. 

One work around for such applications could be to map the textual columns of all tables to a single textual field in Lucene, thuogh then they would have to filter by a table-name field (which they might do anyhow). 


> norms file can become unexpectedly enormous
> -------------------------------------------
>
>                 Key: LUCENE-830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-830
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Priority: Minor
>
> Spinoff from this user thread:
>    http://www.gossamer-threads.com/lists/lucene/java-user/46754
> Norms are not stored sparsely, so even if a doc doesn't have field X
> we still use up 1 byte in the norms file (and in memory when that
> field is searched) for that segment.  I think this is done for
> performance at search time?
> For indexes that have a large # documents where each document can have
> wildly varying fields, each segment will use # documents times # fields
> seen in that segment.  When optimize merges all segments, that product
> grows multiplicatively so the norms file for the single segment will
> require far more storage than the sum of all previous segments' norm
> files.
> I think it's uncommon to have a huge number of distinct fields (?) so
> we would need a solution that doesn't hurt the more common case where
> most documents have the same fields.  Maybe something analogous to how
> bitvectors are now optionally stored sparsely?
> One simple workaround is to disable norms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-830) norms file can become unexpectedly enormous

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480522 ] 

Doron Cohen commented on LUCENE-830:
------------------------------------

> You mean for some of the fields, using Fieldable's setOmitNorms(). 

Oops, just noticed this was already suggested that in that for that user thread...

Anyhow, for that specific scenario seems omitNorms would be sufficient, but it won't help the db based example above.

> norms file can become unexpectedly enormous
> -------------------------------------------
>
>                 Key: LUCENE-830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-830
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Priority: Minor
>
> Spinoff from this user thread:
>    http://www.gossamer-threads.com/lists/lucene/java-user/46754
> Norms are not stored sparsely, so even if a doc doesn't have field X
> we still use up 1 byte in the norms file (and in memory when that
> field is searched) for that segment.  I think this is done for
> performance at search time?
> For indexes that have a large # documents where each document can have
> wildly varying fields, each segment will use # documents times # fields
> seen in that segment.  When optimize merges all segments, that product
> grows multiplicatively so the norms file for the single segment will
> require far more storage than the sum of all previous segments' norm
> files.
> I think it's uncommon to have a huge number of distinct fields (?) so
> we would need a solution that doesn't hurt the more common case where
> most documents have the same fields.  Maybe something analogous to how
> bitvectors are now optionally stored sparsely?
> One simple workaround is to disable norms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org