You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by bo_b <bo...@staff.jubii.dk> on 2006/08/08 10:46:59 UTC

Optimizing a schema

Hello, 

I have tried indexing a vbulletin message board, containing roughly 7
million posts.

My schema is as follows:

   <field name="postid" type="int" indexed="true" stored="true" />
   <field name="threadid" type="int" indexed="false" stored="true" />
   <field name="username" type="string" indexed="false" stored="true" />
   <field name="title" type="string" indexed="false" stored="true" />
   <field name="teaser" type="string" indexed="false" stored="true" />
   <field name="date" type="date" indexed="true" stored="true"
omitNorms="true"/>
   <field name="blob" type="text" indexed="true" stored="false"
multiValued="true" omitNorms="true"/>

 <uniqueKey>postid</uniqueKey>

   <copyField source="username" dest="blob"/>
   <copyField source="title" dest="blob"/>

I am trying to figure out if there is anything I can do to lower the disk
usage and or increase sorting speed before we go live with the search. So a
few questions came to mind

1) Sorting I was planning to do on the date field(aka add "; date desc").
But I was wondering if it would be more efficient to sort on postid
instead(since higher postid in vbulletin=newer post). I already have
indexed=true for postid since its our unique field, but then i could set
indexed=false for date, and perhaps save some storage space?

2) If we sort on postid instead, would we need to use integer, or the sint
type? I assume sint would be faster(?) but perhaps use more storage?

3) About Omitnorms=true, I must admit i dont exactly understand what it does
:) But I read that it would save 1 byte pr document. Are the any other
fields I need to add it to in my schema? As far as I understand
Omitnorms=true only makes a difference for indexed=true fields, and doesnt
do anything for int fields?

Thanks in advance for any suggestions :)

/Bo
-- 
View this message in context: http://www.nabble.com/Optimizing-a-schema-tf2071403.html#a5702635
Sent from the Solr - User forum at Nabble.com.

Re: Optimizing a schema

Posted by bo_b <bo...@staff.jubii.dk>.


Yonik Seeley wrote:
> 
> No, they will be roughly the same speed.
> What you *could* try to do is always *index* documents in postid/date
> order... then sorting would not require any FieldCache entry.  It
> would require a minor change to Solr (allow sorting on lucene internal
> docid, which matches the order that documents are added to an index).
> 
OK, I will look into that, if would be nice to avoid the delay when building
fieldcaches after a commit. 


Yonik Seeley wrote:
> 
> If you need range queries, SortableIntField  values are ordered
> correctly for them to work.
> For sorting, both int and sint fields work... the difference is in how
> the FieldCache entry is built.
>   For IntField, an Integer.parseInt(str) needs to be done for each
> distinct str.
>   SortableIntField is sorted like strings... the ordinal (order in the
> index) is recorded for each distinct value.
> 
>   So sint will build the FieldCache faster, but the string values will
> cause the entry to be larger.  Aftert the FieldCache entry is built,
> both int and sint should be comparable in speed.
> 

I will test it and see what works the best, I think we would prefer being
able to build the fieldcaches faster.

Thanks for all the helpful explanations/hints :)
-- 
View this message in context: http://www.nabble.com/Optimizing-a-schema-tf2071403.html#a5708887
Sent from the Solr - User forum at Nabble.com.

Re: Optimizing a schema

Posted by Yonik Seeley <yo...@apache.org>.

On 8/8/06, bo_b <bo...@staff.jubii.dk> wrote:
> I have tried indexing a vbulletin message board, containing roughly 7
> million posts.
>
> My schema is as follows:
>
>    <field name="postid" type="int" indexed="true" stored="true" />
>    <field name="threadid" type="int" indexed="false" stored="true" />
>    <field name="username" type="string" indexed="false" stored="true" />
>    <field name="title" type="string" indexed="false" stored="true" />
>    <field name="teaser" type="string" indexed="false" stored="true" />
>    <field name="date" type="date" indexed="true" stored="true"
> omitNorms="true"/>
>    <field name="blob" type="text" indexed="true" stored="false"
> multiValued="true" omitNorms="true"/>
>
>  <uniqueKey>postid</uniqueKey>
>
>    <copyField source="username" dest="blob"/>
>    <copyField source="title" dest="blob"/>
>
> I am trying to figure out if there is anything I can do to lower the disk
> usage and or increase sorting speed before we go live with the search. So a
> few questions came to mind
>
> 1) Sorting I was planning to do on the date field(aka add "; date desc").
> But I was wondering if it would be more efficient to sort on postid
> instead(since higher postid in vbulletin=newer post).

No, they will be roughly the same speed.
What you *could* try to do is always *index* documents in postid/date
order... then sorting would not require any FieldCache entry.  It
would require a minor change to Solr (allow sorting on lucene internal
docid, which matches the order that documents are added to an index).

> 2) If we sort on postid instead, would we need to use integer, or the sint
> type? I assume sint would be faster(?) but perhaps use more storage?

If you need range queries, SortableIntField  values are ordered
correctly for them to work.
For sorting, both int and sint fields work... the difference is in how
the FieldCache entry is built.
  For IntField, an Integer.parseInt(str) needs to be done for each distinct str.
  SortableIntField is sorted like strings... the ordinal (order in the
index) is recorded for each distinct value.

  So sint will build the FieldCache faster, but the string values will
cause the entry to be larger.  Aftert the FieldCache entry is built,
both int and sint should be comparable in speed.

> 3) About Omitnorms=true, I must admit i dont exactly understand what it does
> :) But I read that it would save 1 byte pr document.

One byte per document for that indexed field, regardless of if the
field exists for all documents or not.  You loose length normalization
(an increase in score for matches on shorter fields... not needed if
it's not a full-text field anyway), and you loose index-time boosts
(which it doesn't look like you are using).

Since "blob" looks like the body of the post, I think you probably
*want* norms to get the length normalization factors.  Probably all
other indexed fields can have omitNorms="true" (including postid)

>  Are the any other
> fields I need to add it to in my schema? As far as I understand
> Omitnorms=true only makes a difference for indexed=true fields, and doesnt
> do anything for int fields?

omitNorms=true will omit norms for *any* indexed field, including int
fields.  Deep inside Lucene, all indexed fields are string fields.

-Yonik