You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Alan Tanaman <al...@idna-solutions.com> on 2007/01/02 13:57:03 UTC

Creating Lucence Compound Index

Currently Nutch creates a Lucene multifile index, and makes sure any
existing compound index is converted  to multifile by using the
IndexWriter.setUseCompoundFile(false) method.

 

This is done whenever an IndexWriter is opened in the following methods:

org.apache.nutch.indexer.Indexer.getRecordWriter

org.apache.nutch.indexer.IndexSorter.sort

org.apache.nutch.indexer.IndexMerger.merge

 

Is there a technical constraint as to why Nutch should ensure usage of
multifile (or prevent compound) and not allow the type to be set by a
property setting?

 

Does anyone object to/support  a patch to allow this to be configurable?

 

Best regards,

Alan

_________________________

Alan Tanaman

iDNA Solutions

RE: Creating Lucence Compound Index

Posted by Alan Tanaman <al...@idna-solutions.com>.

Thanks for your feedback, we'll get to work on a patch in a day or two.
The config comment will be clear in stating the tradeoff.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: 02 January 2007 14:06
To: nutch-dev@lucene.apache.org
Subject: Re: Creating Lucence Compound Index

Alan Tanaman wrote:
> Agree about the performance degradation (estimated at 5-10% by 
> Gospodnetic et Hatcher), which only affects the indexing time, not the 
> search time, but we would put this as a clear caveat in the conf file.
>   

Note: this is just for the time-related degradation. Temporary space usage
is 200% higher for compound indexes ...

> We'd rather the incremental index process be a little slower (our big 
> performance problem is on parsing anyway), but that the file system 
> work be a little more manageable.
>
> Are there any objections?
>   

I don't object to the idea of having this as an option, defaulting to
non-compound index, with a clear comment in the config file about this
tradeoff.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com

Re: Creating Lucence Compound Index

Posted by Andrzej Bialecki <ab...@getopt.org>.

Alan Tanaman wrote:
> Agree about the performance degradation (estimated at 5-10% by Gospodnetic
> et Hatcher), which only affects the indexing time, not the search time, but
> we would put this as a clear caveat in the conf file.
>   

Note: this is just for the time-related degradation. Temporary space 
usage is 200% higher for compound indexes ...

> We'd rather the incremental index process be a little slower (our big
> performance problem is on parsing anyway), but that the file system work be
> a little more manageable.
>
> Are there any objections?
>   

I don't object to the idea of having this as an option, defaulting to 
non-compound index, with a clear comment in the config file about this 
tradeoff.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Creating Lucence Compound Index

Posted by Alan Tanaman <al...@idna-solutions.com>.

True, but we are less than typical.  ;)  Seriously though, we are using
Nutch to conglomerate many small sources in the enterprise of varying shapes
and sizes, meaning many indexes (even when we merge as many together as
possible).  Others using Nutch in the enterprise for internal crawling may
face the same challenges.

We are at the edge of the acceptable limit, as our enterprise
implementations have a somewhat unusual situation:

* Each index has 20 fields (on average - some have 50! - but let's say 20)
* We have up to 30 indexes built on one machine, including helper indexes

Assuming a worst-case situation of 9 unmerged index-segments, we will get:
30 * 9 * (7 + 20) = 7,290 open files

Whereas with compound, it would be:
30 * 9 = 270 open files

We are currently considering changing the way we use the indexer so it is
incremental (adding a few changed files to the existing index instead of
creating a new one) so this will have the effect of indexes not always being
optimized, so plenty of segments in each index.

Agree about the performance degradation (estimated at 5-10% by Gospodnetic
et Hatcher), which only affects the indexing time, not the search time, but
we would put this as a clear caveat in the conf file.

We'd rather the incremental index process be a little slower (our big
performance problem is on parsing anyway), but that the file system work be
a little more manageable.

Are there any objections?

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: 02 January 2007 13:07
To: nutch-dev@lucene.apache.org
Subject: Re: Creating Lucence Compound Index

Alan Tanaman wrote:
> Currently Nutch creates a Lucene multifile index, and makes sure any
> existing compound index is converted  to multifile by using the
> IndexWriter.setUseCompoundFile(false) method.
>
>  
>
> This is done whenever an IndexWriter is opened in the following methods:
>
> org.apache.nutch.indexer.Indexer.getRecordWriter
>
> org.apache.nutch.indexer.IndexSorter.sort
>
> org.apache.nutch.indexer.IndexMerger.merge
>
>  
>
> Is there a technical constraint as to why Nutch should ensure usage of
> multifile (or prevent compound) and not allow the type to be set by a
> property setting?
>
>  
>
> Does anyone object to/support  a patch to allow this to be configurable?
>
>  
>   

Multifile indexes are somewhat faster, and require much less temporary 
space during indexing. Why would you want to use the compound format 
with Nutch? The typical use of Nutch is that you work with a single or 
at most couple (few) indexes per machine - in such case, regular 
non-compound index works better, and there is no danger of running out 
of file handles.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Creating Lucence Compound Index

Posted by Andrzej Bialecki <ab...@getopt.org>.

Alan Tanaman wrote:
> Currently Nutch creates a Lucene multifile index, and makes sure any
> existing compound index is converted  to multifile by using the
> IndexWriter.setUseCompoundFile(false) method.
>
>  
>
> This is done whenever an IndexWriter is opened in the following methods:
>
> org.apache.nutch.indexer.Indexer.getRecordWriter
>
> org.apache.nutch.indexer.IndexSorter.sort
>
> org.apache.nutch.indexer.IndexMerger.merge
>
>  
>
> Is there a technical constraint as to why Nutch should ensure usage of
> multifile (or prevent compound) and not allow the type to be set by a
> property setting?
>
>  
>
> Does anyone object to/support  a patch to allow this to be configurable?
>
>  
>   

Multifile indexes are somewhat faster, and require much less temporary 
space during indexing. Why would you want to use the compound format 
with Nutch? The typical use of Nutch is that you work with a single or 
at most couple (few) indexes per machine - in such case, regular 
non-compound index works better, and there is no danger of running out 
of file handles.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com