You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2005/10/03 01:24:16 UTC
Re: Considering lucene
On 01/10/2005, at 6:30 AM, Erik Hatcher wrote:
>
> On Sep 30, 2005, at 1:26 AM, Paul Smith wrote:
>
>
>> This requirement is almost exactly the same as my requirement for
>> the log4j project I work on where I wanted to be able to index
>> every row in a text log file to be it's own Document.
>>
>> It works fine, but treating each line as a Document turns out to
>> take a while to index (searching is fantastic though I have to
>> say) due to the cost of adding a Document to an index. I don't
>> think Lucene is currently tuned (or tunable) to that level of
>> Document granularity, so it'll depend on your requirement of
>> timeliness of the indexing.
>>
>
> There are several tunable indexing parameters that can help with
> batch indexing. By default it is mostly tuned for incremental
> indexing, but for rapid batch indexing you may need to tune it to
> merge less often.
Yep, mergeFactor et al. We currently have it at 1000 (with 8
concurrent threads creating Project-based indices, so that could be
8000 open files during search, unless I'm mistaken), plus increased
the value for maxBufferedDocs as per standard practices.
>
>
>>
>> I was hoping (of course it's a big ask) to be able to index a
>> million rows of relatively short lines of text (as log files tend
>> to be) in a 'few moments", no more than 1 minute, but even with
>> pretty grunty hardware you run up against the bottleneck of the
>> tokenization process (the StandardAnalyzer is not optimal at all
>> in this case because of the way it 'signals' EOF with an exception).
>>
>
> Signals EOF with an exception? I'm not following that. Where does
> that occur?
>
See our recent YourKit "sampling" profile export here:
http://people.apache.org/~psmith/For%20Lucene%20list/
IOExceptionProfiling.html
This is a full production test run over 5 hours indexing 6.5 million
records (approx 30 fields) running on Dual P4 Xeon servers with 10K
SCSI disks. You'll note that a good chunk (35%) of the time of the
indexing thread is spent in 2 methods of the
StandardTokenizerManager. When you look at the source code for these
2 methods you will see that it relies FastCharStream's method of
IOException to 'flag' EOF:
if (charsRead == -1)
throw new IOException("read past eof");
(line 72-ish)
Of course, we _could_ always write our own analyzer, but it would be
real nice if the out-of-the-box one was even better.
>
>> There was someone (apoligise, I've forgotten his name, I blame the
>> holiday I just came back from) that could treat a relatively small
>> file, such as an XML file, and very quickly index that for on the
>> fly XPath like queries using Lucene which apparently works very
>> well, but I'm not sure it scales to massive documents such as log
>> files (and your requirements).
>>
>
> Wolfgang Hoschek and the NUX project may be what you're referring
> to. He contributed the MemoryIndex feature found under contrib/
> memory. I'm not sure that feature is a good fit for the log file
> or indexing files line-by-line though.
Yes, Wolfgang's code is very cool, but would only work on small texts.
cheers,
Paul Smith
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org