You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2005/10/03 01:24:16 UTC
Re: Considering lucene

On 01/10/2005, at 6:30 AM, Erik Hatcher wrote:

>
> On Sep 30, 2005, at 1:26 AM, Paul Smith wrote:
>
>
>> This requirement is almost exactly the same as my requirement for  
>> the log4j project I work on where I wanted to be able to index  
>> every row in a text log file to be it's own Document.
>>
>> It works fine, but treating each line as a Document turns out to  
>> take a while to index (searching is fantastic though I have to  
>> say) due to the cost of adding a Document to an index.  I don't  
>> think Lucene is currently tuned (or tunable) to that level of  
>> Document granularity, so it'll depend on your requirement of  
>> timeliness of the indexing.
>>
>
> There are several tunable indexing parameters that can help with  
> batch indexing.  By default it is mostly tuned for incremental  
> indexing, but for rapid batch indexing you may need to tune it to  
> merge less often.

Yep, mergeFactor et al.  We currently have it at 1000 (with 8  
concurrent threads creating Project-based indices, so that could be  
8000 open files during search, unless I'm mistaken), plus increased  
the value for maxBufferedDocs as per standard practices.
>
>
>>
>> I was hoping (of course it's a big ask) to be able to index a  
>> million rows of relatively short lines of text (as log files tend  
>> to be) in a 'few moments", no more than 1 minute, but even with  
>> pretty grunty hardware you run up against the bottleneck of the  
>> tokenization process (the StandardAnalyzer is not optimal at all  
>> in this case because of the way it 'signals' EOF with an exception).
>>
>
> Signals EOF with an exception?  I'm not following that.  Where does  
> that occur?
>

See our recent YourKit "sampling" profile export here:

http://people.apache.org/~psmith/For%20Lucene%20list/ 
IOExceptionProfiling.html

This is a full production test run over 5 hours indexing 6.5 million  
records (approx 30 fields) running on Dual P4 Xeon servers with 10K  
SCSI disks. You'll note that a good chunk (35%) of the time of the  
indexing thread is spent in 2 methods of the  
StandardTokenizerManager.  When you look at the source code for these  
2 methods you will see that it relies FastCharStream's method  of  
IOException to 'flag' EOF:

     if (charsRead == -1)
       throw new IOException("read past eof");

(line 72-ish)

Of course, we _could_ always write our own analyzer, but it would be  
real nice if the out-of-the-box one was even better.


>
>> There was someone (apoligise, I've forgotten his name, I blame the  
>> holiday I just came back from) that could treat a relatively small  
>> file, such as an XML file, and very quickly index that for on the  
>> fly XPath like queries using Lucene which apparently works very  
>> well, but I'm not sure it scales to massive documents such as log  
>> files (and your requirements).
>>
>
> Wolfgang Hoschek and the NUX project may be what you're referring  
> to.  He contributed the MemoryIndex feature found under contrib/ 
> memory.  I'm not sure that feature is a good fit for the log file  
> or indexing files line-by-line though.

Yes, Wolfgang's code is very cool, but would only work on small texts.

cheers,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org