You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ivan Hrytsyuk <ih...@softserveinc.com> on 2011/11/10 13:42:00 UTC

[Solr-3.4] Norms file size is large in case of many unique indexed fields in index

Hello everyone,

We have large index size in case norms are enabled.

schema.xml:

type declaration:
<fieldType name="simpleTokenizer" class="solr.TextField"
positionIncrementGap="100" omitNorms="false">
     <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory" />
     </analyzer>
</fieldType>

fields declaration:
<field name="id" stored="true" indexed="true" required="true"
type="string" />
<field name="name" stored="true" indexed="true" type="string" />
<dynamicField name="unique_*" stored="false" indexed="true"
type="simpleTokenizer" multiValued="false" />

For 5000 documents (every document has 2 unique fields, 2*5000=10000
unique fields in index), index size is 48.24 MB.
But if we enable omitting norms (omitNorms="true"), index size is 0.56
MB.

Next, if we increase number of unique fields per document to 3
(3*5000=15000 unique fields in index) we receive: 72.23 MB and 0.70 MB
respectively.
And if we increase number of documents to 10000 ( 3*10000 unique fields
in index) we receive: 287.54 MB and 1.44 MB respectively.

We've prepared test application to reproduce mentioned behavior. It can
be downloaded here:
https://bitbucket.org/coldserenity/solr-large-index-with-norms

Could anyone point out if size of index is as expected in mentioned
cases? And if it's, what configuration can be applied to reduce size of
index.

Thank you in advance, Ivan

Re: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index

Posted by Robert Muir <rc...@gmail.com>.
what is the point of a unique indexed field?

If for all of your fields, there is only one possible document, you
don't need length normalization, scoring, or a search engine at all...
just use a HashMap?

On Thu, Nov 10, 2011 at 7:42 AM, Ivan Hrytsyuk
<ih...@softserveinc.com> wrote:
> Hello everyone,
>
> We have large index size in case norms are enabled.
>
> schema.xml:
>
> type declaration:
> <fieldType name="simpleTokenizer" class="solr.TextField"
> positionIncrementGap="100" omitNorms="false">
>     <analyzer>
>         <tokenizer class="solr.KeywordTokenizerFactory" />
>     </analyzer>
> </fieldType>
>
> fields declaration:
> <field name="id" stored="true" indexed="true" required="true"
> type="string" />
> <field name="name" stored="true" indexed="true" type="string" />
> <dynamicField name="unique_*" stored="false" indexed="true"
> type="simpleTokenizer" multiValued="false" />
>
> For 5000 documents (every document has 2 unique fields, 2*5000=10000
> unique fields in index), index size is 48.24 MB.
> But if we enable omitting norms (omitNorms="true"), index size is 0.56
> MB.
>
> Next, if we increase number of unique fields per document to 3
> (3*5000=15000 unique fields in index) we receive: 72.23 MB and 0.70 MB
> respectively.
> And if we increase number of documents to 10000 ( 3*10000 unique fields
> in index) we receive: 287.54 MB and 1.44 MB respectively.
>
> We've prepared test application to reproduce mentioned behavior. It can
> be downloaded here:
> https://bitbucket.org/coldserenity/solr-large-index-with-norms
>
> Could anyone point out if size of index is as expected in mentioned
> cases? And if it's, what configuration can be applied to reduce size of
> index.
>
> Thank you in advance, Ivan
>



-- 
lucidimagination.com

RE: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index

Posted by Ivan Hrytsyuk <ih...@softserveinc.com>.
Thank you guys for responses.

Some background on the task:
The problem we are trying to solve with Solr is the following. 
We have to provide a full-text search over documents that partially consist of fields that are always there and partially of additional metadata as key-value pairs where keys are not known beforehand. Yet we need to be able to search on the content of that additional meta-data.

Becuase we have to provide FTS abilities we have used Solr and not a HashMap or some BigTable.
To address the "optionality" of additional metadata fields and their searcheability we have decided to use Solr indexed dynamic fields. 

Questions:
1. Yonik, will your approach work for us with next data:
doc1
  uniqueFields:["100=boo foo roo","101=bar bar 100 boo"]
doc2
  uniqueFields:["101=boo roo","102=bar foo 101 boo"]
and we want to fetch documents that contain value 'foo' in metadata with field key: 100? (that is only doc1 should be returned)

2. Should I post issue to JIRA about large index size, or it's expected behaviour in our case?

Thanks, Ivan
 


________________________________________
From: yseeley@gmail.com [yseeley@gmail.com] On Behalf Of Yonik Seeley [yonik@lucidimagination.com]
Sent: Thursday, November 10, 2011 10:22 PM
To: solr-user@lucene.apache.org
Subject: Re: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index

On Thu, Nov 10, 2011 at 7:42 AM, Ivan Hrytsyuk
<ih...@softserveinc.com> wrote:
> For 5000 documents (every document has 2 unique fields, 2*5000=10000
> unique fields in index), index size is 48.24 MB.

You might be able to turn this around and encode the "unique field"
information in a multi-valued field:

For example, instead of
  myUniqueField100:"foo"  myUniqueField101:"bar"
you could do
  uniqueFields:["100=foo","101=bar"]

The exact details depend on how you are going to use/query these
fields of course.

-Yonik
http://www.lucidimagination.com

Re: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Nov 10, 2011 at 7:42 AM, Ivan Hrytsyuk
<ih...@softserveinc.com> wrote:
> For 5000 documents (every document has 2 unique fields, 2*5000=10000
> unique fields in index), index size is 48.24 MB.

You might be able to turn this around and encode the "unique field"
information in a multi-valued field:

For example, instead of
  myUniqueField100:"foo"  myUniqueField101:"bar"
you could do
  uniqueFields:["100=foo","101=bar"]

The exact details depend on how you are going to use/query these
fields of course.

-Yonik
http://www.lucidimagination.com