You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stephen Greene <SG...@metalseconomics.com> on 2009/09/08 13:57:32 UTC

large document with multiple fields performance

Hello,

 

I am new to lucene and building an application which requires documents
with many fields to be searched.

A "project" id is being stored (not_analyzed) and all matching project
ids will be returned to be used to join other data from a database.

Will it provide better performance to store each comment field in a
separate document with the project ID and a comment ID or to store all
the comments for a single project in a single document with multiple
fields?

 

Thanks,

 

Steve Greene


RE: large document with multiple fields performance

Posted by Stephen Greene <SG...@metalseconomics.com>.
Hi Anshum,

Thanks for your insight. I will stick with the 20 fields. 
I realized that I had neglected to mention that in a separate query I
will search on the primary key and a search term to return details about
how many hits come from each field. Is it safe to assume that this will
also not be a problem and implementing a custom hitcollector will do the
trick?

Thanks again,

Steve 

-----Original Message-----
From: Anshum [mailto:anshumg@gmail.com] 
Sent: Tuesday, September 08, 2009 2:08 PM
To: java-user@lucene.apache.org
Subject: Re: large document with multiple fields performance

Hey Steve,

I'd suggest you go with the 20 fields (Non normalized) model. I've used
much
larger models and they happen to work just fine. Wouldnt be a point
increasing the complexity.
Hope that clarifies things a little atleast :)
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Sep 8, 2009 at 6:57 PM, Stephen Greene
<SG...@metalseconomics.com>wrote:

> Hi Anshum,
>
> Thank you for your reply. I have two options I am considering.
> One would be:
> Document {
>        String projectID;
>        String generalComment;
>        String workHistoryComment;
>        String environmentalComment;
>        String claimsComment;
>        ...
> }
>
> And the document may contain upwards of 20 comment fields.
>
> The other option would be to normalize the data
> Document {
>        String projectID;
>        String commentType;
>        String comment;
> }
>
> I will need to return only the projectID for all found documents. I
have
> implemented a custom Collector to capture the projectID for each
> document. Then it occurred to me that I might be better served by the
> normalized document model. But I am wondering which method will have
> better performance: possibly returning 20 documents per hit, or having
> to search 20 fields per document? (This also has implications for the
> query, as each search term will always search all fields, this is
> somewhat easier in the normalized example as opposed to creating 20
"or"
> queries.)
>
> Thanks,
>
> Steve
>
> -----Original Message-----
> From: Anshum [mailto:anshumg@gmail.com]
> Sent: Tuesday, September 08, 2009 9:47 AM
> To: java-user@lucene.apache.org
> Subject: Re: large document with multiple fields performance
>
> Hi Stephen,
> Could you clarify more on the requirement. Do you intend to have data
in
> index as:
> Document{
>  String Comment;
>  String CommentId;
>  String ProjectId;
> }
>
> How do you intend to index it.. as in the doc structure? Is there  a
> primary
> key there? What would you search on? What would you want to have as
the
> result?
> All said and done, its not really an overhead as long as the number of
> fields is within normal bounds.
>
>
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
>
>
> On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
> <SG...@metalseconomics.com>wrote:
>
> > Hello,
> >
> >
> >
> > I am new to lucene and building an application which requires
> documents
> > with many fields to be searched.
> >
> > A "project" id is being stored (not_analyzed) and all matching
project
> > ids will be returned to be used to join other data from a database.
> >
> > Will it provide better performance to store each comment field in a
> > separate document with the project ID and a comment ID or to store
all
> > the comments for a single project in a single document with multiple
> > fields?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Steve Greene
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: large document with multiple fields performance

Posted by Anshum <an...@gmail.com>.
Hey Steve,

I'd suggest you go with the 20 fields (Non normalized) model. I've used much
larger models and they happen to work just fine. Wouldnt be a point
increasing the complexity.
Hope that clarifies things a little atleast :)
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Sep 8, 2009 at 6:57 PM, Stephen Greene
<SG...@metalseconomics.com>wrote:

> Hi Anshum,
>
> Thank you for your reply. I have two options I am considering.
> One would be:
> Document {
>        String projectID;
>        String generalComment;
>        String workHistoryComment;
>        String environmentalComment;
>        String claimsComment;
>        ...
> }
>
> And the document may contain upwards of 20 comment fields.
>
> The other option would be to normalize the data
> Document {
>        String projectID;
>        String commentType;
>        String comment;
> }
>
> I will need to return only the projectID for all found documents. I have
> implemented a custom Collector to capture the projectID for each
> document. Then it occurred to me that I might be better served by the
> normalized document model. But I am wondering which method will have
> better performance: possibly returning 20 documents per hit, or having
> to search 20 fields per document? (This also has implications for the
> query, as each search term will always search all fields, this is
> somewhat easier in the normalized example as opposed to creating 20 "or"
> queries.)
>
> Thanks,
>
> Steve
>
> -----Original Message-----
> From: Anshum [mailto:anshumg@gmail.com]
> Sent: Tuesday, September 08, 2009 9:47 AM
> To: java-user@lucene.apache.org
> Subject: Re: large document with multiple fields performance
>
> Hi Stephen,
> Could you clarify more on the requirement. Do you intend to have data in
> index as:
> Document{
>  String Comment;
>  String CommentId;
>  String ProjectId;
> }
>
> How do you intend to index it.. as in the doc structure? Is there  a
> primary
> key there? What would you search on? What would you want to have as the
> result?
> All said and done, its not really an overhead as long as the number of
> fields is within normal bounds.
>
>
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
>
>
> On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
> <SG...@metalseconomics.com>wrote:
>
> > Hello,
> >
> >
> >
> > I am new to lucene and building an application which requires
> documents
> > with many fields to be searched.
> >
> > A "project" id is being stored (not_analyzed) and all matching project
> > ids will be returned to be used to join other data from a database.
> >
> > Will it provide better performance to store each comment field in a
> > separate document with the project ID and a comment ID or to store all
> > the comments for a single project in a single document with multiple
> > fields?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Steve Greene
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: large document with multiple fields performance

Posted by Stephen Greene <SG...@metalseconomics.com>.
Hi Anshum,

Thank you for your reply. I have two options I am considering.
One would be:
Document {
	String projectID;
	String generalComment;
	String workHistoryComment;
	String environmentalComment;
	String claimsComment;
	...
}

And the document may contain upwards of 20 comment fields.

The other option would be to normalize the data
Document {
	String projectID;
	String commentType;
	String comment;
}

I will need to return only the projectID for all found documents. I have
implemented a custom Collector to capture the projectID for each
document. Then it occurred to me that I might be better served by the
normalized document model. But I am wondering which method will have
better performance: possibly returning 20 documents per hit, or having
to search 20 fields per document? (This also has implications for the
query, as each search term will always search all fields, this is
somewhat easier in the normalized example as opposed to creating 20 "or"
queries.)

Thanks,

Steve

-----Original Message-----
From: Anshum [mailto:anshumg@gmail.com] 
Sent: Tuesday, September 08, 2009 9:47 AM
To: java-user@lucene.apache.org
Subject: Re: large document with multiple fields performance

Hi Stephen,
Could you clarify more on the requirement. Do you intend to have data in
index as:
Document{
  String Comment;
  String CommentId;
  String ProjectId;
}

How do you intend to index it.. as in the doc structure? Is there  a
primary
key there? What would you search on? What would you want to have as the
result?
All said and done, its not really an overhead as long as the number of
fields is within normal bounds.


--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
<SG...@metalseconomics.com>wrote:

> Hello,
>
>
>
> I am new to lucene and building an application which requires
documents
> with many fields to be searched.
>
> A "project" id is being stored (not_analyzed) and all matching project
> ids will be returned to be used to join other data from a database.
>
> Will it provide better performance to store each comment field in a
> separate document with the project ID and a comment ID or to store all
> the comments for a single project in a single document with multiple
> fields?
>
>
>
> Thanks,
>
>
>
> Steve Greene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: large document with multiple fields performance

Posted by Anshum <an...@gmail.com>.
Hi Stephen,
Could you clarify more on the requirement. Do you intend to have data in
index as:
Document{
  String Comment;
  String CommentId;
  String ProjectId;
}

How do you intend to index it.. as in the doc structure? Is there  a primary
key there? What would you search on? What would you want to have as the
result?
All said and done, its not really an overhead as long as the number of
fields is within normal bounds.


--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
<SG...@metalseconomics.com>wrote:

> Hello,
>
>
>
> I am new to lucene and building an application which requires documents
> with many fields to be searched.
>
> A "project" id is being stored (not_analyzed) and all matching project
> ids will be returned to be used to join other data from a database.
>
> Will it provide better performance to store each comment field in a
> separate document with the project ID and a comment ID or to store all
> the comments for a single project in a single document with multiple
> fields?
>
>
>
> Thanks,
>
>
>
> Steve Greene
>
>