You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kumarlimbu <ku...@gmail.com> on 2007/11/20 12:44:13 UTC

Is storing 20 fields in a lucene document desirable?

Our document contains a total of 23 fields in one document and we STORE all
of them in lucene index. 

We have recently had some performance issues and our analysis has shown the
bottleneck to be lucene search and retrieval.

We have been thinking about reducing the number of fields per document by
removing unnecessary fields and by merging fields with similar weightings.
Will reducing the number of fields help to optimize performance?

Another issue is we are currently retrieving around 9 fields after we do a
search. Some are long text of up to 1000 words. Is it a large overhead to
retrieve long fields? 

We are considering the option of separating the search and retrieve parts so
that Lucene performs the search, MYSQL stores the data. We just store the
INDEXED field and primary key in the lucene index. After searching we only
return 1 field (primary key) instead of 9 fields. This field will be used
for retrieving the actual information from the MySQL database. Will reducing
the number of fields retrieved from lucene reduce the response time or will
using MySQL database make it worse? 

So our main concern is to find if retrieving fields usually takes longer
than searching or not? What does lucene spend most time doing – search or
retrieval? We are also concerned that using MySQL will have performance
issues because we will be doing I/O for MySQL as well as Lucene. We also add
around 100k documents each day and remove around the same number of
documents. Will this frequent read and write have impact on performance?

Our current index size is around 8G and contains around 3M documents.

If there are any suggestions or questions, please reply.

Thanks you all!

Regards,
Kumar

-- 
View this message in context: http://www.nabble.com/Is-storing-20-fields-in-a-lucene-document-desirable--tf4843009.html#a13855492
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Is storing 20 fields in a lucene document desirable?

Posted by Sagar Naik <sa...@visvo.com>.
Hey ,

 - I am not sure if u have tried this or not, but u can try splitting 
the index onto a 3-4 machines
 - Reducing the fields would definitely help.(as you have said)
 -  What kind of queries do u fire ?  Are they BooleanQueries or 
exhausting queries like RangeQuery
 - I am not sure if u did this, but should not index fields on which you 
wont fire query.
 - Retrieval: Lets say u have asked for top 10 docs, Lucene will only go 
and fetch the fields on only these top 10 docs and
    not from all docs that matched up against query.

kumarlimbu wrote:
> Our document contains a total of 23 fields in one document and we STORE all
> of them in lucene index. 
>
> We have recently had some performance issues and our analysis has shown the
> bottleneck to be lucene search and retrieval.
>
> We have been thinking about reducing the number of fields per document by
> removing unnecessary fields and by merging fields with similar weightings.
> Will reducing the number of fields help to optimize performance?
>
> Another issue is we are currently retrieving around 9 fields after we do a
> search. Some are long text of up to 1000 words. Is it a large overhead to
> retrieve long fields? 
>
> We are considering the option of separating the search and retrieve parts so
> that Lucene performs the search, MYSQL stores the data. We just store the
> INDEXED field and primary key in the lucene index. After searching we only
> return 1 field (primary key) instead of 9 fields. This field will be used
> for retrieving the actual information from the MySQL database. Will reducing
> the number of fields retrieved from lucene reduce the response time or will
> using MySQL database make it worse? 
>
> So our main concern is to find if retrieving fields usually takes longer
> than searching or not? What does lucene spend most time doing – search or
> retrieval? We are also concerned that using MySQL will have performance
> issues because we will be doing I/O for MySQL as well as Lucene. We also add
> around 100k documents each day and remove around the same number of
> documents. Will this frequent read and write have impact on performance?
>
> Our current index size is around 8G and contains around 3M documents.
>
> If there are any suggestions or questions, please reply.
>
> Thanks you all!
>
> Regards,
> Kumar
>
>   


-- 
This message has been scanned for viruses and
dangerous content and is believed to be clean.