You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Yogesh Chawla - PD <pr...@yahoo.com> on 2009/01/20 23:29:35 UTC

New to Solr/Lucene design question

Hello All,
We are using SOLR/Lucene as the search engine for an application
we are designing.  The application is a workflow application that can
receive different types of documents.

For example, we are currently working on getting booking documents but
will also accept arrest documents later this year.

We have defined a custom schema that incorporates some schemas designed
by federal consortiums.  From those schemas we pluck out values that we want 
SOLR/Lucene to index and search on and we go from our instance document to
a SOLR document.

The fields in our schema.xml look like this:

 <fields>
    <!--   record-uri, unique identifier for any type of record  -->
   <field name="record-uri" type="string" indexed="true" stored="true" required="true" /> 
   <!--   stash-filepath, path to the entire XML document on the file system -->
   <field name="stash-filepath" type="string" indexed="true" stored="true" required="true" />
   <!--   stash-content THIS IS THE FIELD I HAVE QUESTIONS ABOUT-->
   <field name="stash-content" type="string" indexed="true" stored="true" termVectors="true" multiValued="true" ssomitNorms="true"/>
</fields>

Above, there is a field called "stash-content".  The goal is to take any search able data from
any document type and put it in this field.  For example, we would store data like this in XML format:


<add>
  <doc>
    <field name="stash-content">arrestee_firstname_Yogesh</field>
    <field name="stash-content">arrestee_lastname_Chawla</field>
    <field name="stash-content">arrestee_middlename_myMiddleName</field>
  </doc>
</add>
The advantage to such an approach is that we can add new document types to search on and as long
as they use the same semantics such as arrestee_firstname
that we won't to update any code.  It also makes
the code simple and generic for any document type.

We can search on first name like this for a starts with query:arrestee_firstname_Y*.  We had to use
the _ instead of a space so that each word would not be searched when a query was performed and only
a single string would be searched.  (hope that makes sense).

The cons could be a performance hit.  

The other approach is to add fields explicitly like this:

<add>
  <doc>
    <field name="arrestee_firstname">Yogesh</field>
    <field name="arrestee_lastname">Chawla</field>
    <field name="arrestee_middlename">myMiddleName</field>
  </doc>
</add>
This approach seems more traditional.  The pros of it are that it is straight forward.  The cons are that every time
we add a new document type to search on, we have to update schema.xml and the java code that creates SOLR
documents.

The number of documents that we will eventually want to search on is about 5 million.  However, this will take a while
to ramp up to and we are more immediately looking at searching on about 100,000.

I am new to SOLR and just inherited this project with approach number 1.  Is this something that is going to bite us in the
future?

Thanks,
Yogesh

RE: New to Solr/Lucene design question

Posted by "Feak, Todd" <To...@smss.sony.com>.
A third option - Use dynamic fields.

Add a dynamic field call "*_stash". This will allow new fields for
documents to be added down the road without changing schema.xml, yet
still allow you to query on fields like "arresteeFirstName_stash"
without extra overhead.

-Todd Feak

-----Original Message-----
From: Yogesh Chawla - PD [mailto:premiergeneration@yahoo.com] 
Sent: Tuesday, January 20, 2009 2:30 PM
To: solr-user@lucene.apache.org
Subject: New to Solr/Lucene design question

Hello All,
We are using SOLR/Lucene as the search engine for an application
we are designing.  The application is a workflow application that can
receive different types of documents.

For example, we are currently working on getting booking documents but
will also accept arrest documents later this year.

We have defined a custom schema that incorporates some schemas designed
by federal consortiums.  From those schemas we pluck out values that we
want 
SOLR/Lucene to index and search on and we go from our instance document
to
a SOLR document.

The fields in our schema.xml look like this:

 <fields>
    <!--   record-uri, unique identifier for any type of record  -->
   <field name="record-uri" type="string" indexed="true" stored="true"
required="true" /> 
   <!--   stash-filepath, path to the entire XML document on the file
system -->
   <field name="stash-filepath" type="string" indexed="true"
stored="true" required="true" />
   <!--   stash-content THIS IS THE FIELD I HAVE QUESTIONS ABOUT-->
   <field name="stash-content" type="string" indexed="true"
stored="true" termVectors="true" multiValued="true" ssomitNorms="true"/>
</fields>

Above, there is a field called "stash-content".  The goal is to take any
search able data from
any document type and put it in this field.  For example, we would store
data like this in XML format:


<add>
  <doc>
    <field name="stash-content">arrestee_firstname_Yogesh</field>
    <field name="stash-content">arrestee_lastname_Chawla</field>
    <field name="stash-content">arrestee_middlename_myMiddleName</field>
  </doc>
</add>
The advantage to such an approach is that we can add new document types
to search on and as long
as they use the same semantics such as arrestee_firstname
that we won't to update any code.  It also makes
the code simple and generic for any document type.

We can search on first name like this for a starts with
query:arrestee_firstname_Y*.  We had to use
the _ instead of a space so that each word would not be searched when a
query was performed and only
a single string would be searched.  (hope that makes sense).

The cons could be a performance hit.  

The other approach is to add fields explicitly like this:

<add>
  <doc>
    <field name="arrestee_firstname">Yogesh</field>
    <field name="arrestee_lastname">Chawla</field>
    <field name="arrestee_middlename">myMiddleName</field>
  </doc>
</add>
This approach seems more traditional.  The pros of it are that it is
straight forward.  The cons are that every time
we add a new document type to search on, we have to update schema.xml
and the java code that creates SOLR
documents.

The number of documents that we will eventually want to search on is
about 5 million.  However, this will take a while
to ramp up to and we are more immediately looking at searching on about
100,000.

I am new to SOLR and just inherited this project with approach number 1.
Is this something that is going to bite us in the
future?

Thanks,
Yogesh