You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David Neubert <de...@yahoo.com> on 2007/11/10 18:53:54 UTC

Redundant indexing * 4 only solution (for par/sen and case sensitivity)

Hi all,

Using SOLR, I believe I have to index the same content 4 times (not desirable) into 2 indexes -- and I don't know how you can practically do multiple indexes in SOLR (if indeed there is no better solution than 4 indexing runs into two indexes?

My need is case-sensitive and case insensitive searches over well formed XML content (books), performing exact searches at the paragraph and sentence levels -- no errors over approximate boundaries -- the source content has exact par/sen tags.

I have already proven a pretty nice solution for par/sen indexing twice into the same index in SOLR.  I have added a tags field, and put correlative XML tags (comma delimited) into this field (one of which is either a para or sen flag) which flags the document (partial) as a paragraph or sentence.  Thus all paragraphs of the book are indexed as single document (with its sentences combined and concatenated) and then all sentences in the book are indexed again as single documents.  Both go into the same SOLR index. I just add an AND "tags:para" or "tags:sen" to my search and everything works fine.

The obvious downside to this approach is the 2X indexing, but it does execute quite nicely on a single Index using SOLR. This obviously doesn't scale nicely, but will do for quite a while probably.

I thought I could live with that....

But then I moved on to case sensitive and case-insensitive searches, and my research so far is pointing to one index for each case.

So now I have:
(1) 4X in content indexing
(2) 2X in actual SOLR/Lucene indices
(3) I don't know how to practically due multiple indices using SOLR?

If there is a better way of attacking this problem, I would appreciate recommendations!!!

Also, I don't know how to do multiple indices in SOLR -- I have heard it might be available in 1.3.0.?  If this is my only recourse, please advise me where really good documentation is available on building 1.3.0.  I am not admin savvy, but I did succeed in getting SOLR up myself and navigation through it with the help of this forum.  But I have that building 1.3.0 (as opposed to downloading and installing it, like in 1.2.0) is a whole different experience and much more complex.

Thanks

Dave





__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

Posted by Ryan McKinley <ry...@gmail.com>.
> So now I have:
> (1) 4X in content indexing
> (2) 2X in actual SOLR/Lucene indices
> (3) I don't know how to practically due multiple indices using SOLR?
> 
> If there is a better way of attacking this problem, I would appreciate recommendations!!!
> 

I don't quite follow your current approach, but it sounds like you just 
needs some copyFields to index the same content with multiple analyzers.

for example, say you have fields:

  <field name="content" type="string" indexed="true" stored="true"/>
  <field name="content_sentence" type="sentence" indexed="true" 
stored="false"/>
  <field name="content_paragraph" type="paragraph" indexed="true" 
stored="false"/>
  <field name="content_text" type="text" indexed="true" stored="false"/>

and copy fields:

   <copyField source="content" dest="content_sentence"/>
   <copyField source="content" dest="content_paragraph"/>
   <copyField source="content" dest="content_text"/>


The 4X indexing cost?  If you *need* to index the content 4 different 
ways, you don't have any way around that - do you?  But is it really a 
big deal?  How often does it need to index?  How big is the data?

I'm not quite following your need for multiple solr indicies, but in 1.3 
it is possible.

ryan