You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Barry <mb...@cos.com> on 2003/02/24 20:00:32 UTC

Indexing Tips and Hints

All,
   I'm in need of some pointers, hints or tips on indexing large collections
of data. I know I saw some tips on this list before but when I tried 
searching
the list, I came up blank.
   I have a large collection of XML files (336000 files around 5K 
apiece) that I'm
indexing and its taking quite a bit of time (27 hours). I've played 
around with the
mergeFactor, RAMDirectories and multiple threads (X number of threads 
indexing
a subset of the data and then merging the indexes at the end) but I 
cannot seem
to bring the time down. I'm probably not doing these things properly but 
from
what I read I believe I am.  Maybe this is the best I can do with this 
data but I
would be really grateful to hear how others have tackled this same issue.
   As always pointers to places in the mailing list archive or other 
places would be
appreciated.

Thanks, Mike.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Doug Cutting <cu...@lucene.com>.
These sort of tricks can help things some if index i/o is really your 
bottleneck.  Are you convinced that it is?  When i/o is a bottleneck the 
CPU typically spends a large portion of its time idle.  Do you see this?

 From your description (indexing ~300k 5k documents takes over 24 hours) 
I would be very surprised if index i/o is your bottleneck.  Rather I 
would might suspect the XML parsing or somesuch.

In general, Lucene's default settings are designed to give good 
performance.  If pumping up some parameter made a huge performance 
improvement with little other impact then it would be pumped up by 
default.  Increasing the mergeFactor speeds things somewhat, but it also 
causes more file handles to be used.

When Karl talks of "flushing" a RAM-based index to disk, I suspect he's 
using IndexWriter.addIndexes().  Reading his message, I'd be surprised 
if his performance is really much better than it would be if he just set 
mergeFactor to 50 and then optimized the index just once at the end, and 
that is a lot less work.

Doug

Michael Barry wrote:
> Thanks for all the info. I've been working on streamlining my indexing 
> and I've finally
> found the message from last year that intrigued me
> 
> (http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgNo=1220). 
> 
> 
> In that message, karl øie suggests
> 
> 1. use a ramdir, and mutliple fsdirs
> 2. merge the fsdirs into a single fsdir
> 3. use threads
> 
> (Of course he provides more details.)
> 
> I have a question concerning RAMDirectories - is there any benefit using 
> them over setting the
> mergeFactor higher? Also, I notice a lot of  advice to use 
> RAMDirectories but not much verbage on
> how to use them effectively.
> 
> In the above msg from Karl, he suggests writing to a RAMDirectory and 
> then at
> some point flush the RAMDirectory to an FSDirectory. Anyone have any 
> code to illuminate
> that? It's the "flushing" part that's getting me. Is flushing just 
> calling list() on the
> RAMDirectory and then deleteFile() each one? Originally I was just 
> creating a new
> RAMDirectory each time I needed one (not the best approach but it does 
> work).
> 
> I know I should spend time profiling the code and see exactly where the 
> bottle necks
> occur and I will do that but I'd like to get a good handle on the 
> multiple ways to
> index also.
> 
> Thanks for your time, Mike.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Michael Barry <mb...@cos.com>.
Thanks for all the info. I've been working on streamlining my indexing 
and I've finally
found the message from last year that intrigued me

 (http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgNo=1220).

In that message, karl øie suggests

1. use a ramdir, and mutliple fsdirs
2. merge the fsdirs into a single fsdir
3. use threads

(Of course he provides more details.)

I have a question concerning RAMDirectories - is there any benefit using 
them over setting the
mergeFactor higher? Also, I notice a lot of  advice to use 
RAMDirectories but not much verbage on
how to use them effectively.

In the above msg from Karl, he suggests writing to a RAMDirectory and 
then at
some point flush the RAMDirectory to an FSDirectory. Anyone have any 
code to illuminate
that? It's the "flushing" part that's getting me. Is flushing just 
calling list() on the
RAMDirectory and then deleteFile() each one? Originally I was just 
creating a new
RAMDirectory each time I needed one (not the best approach but it does 
work).

I know I should spend time profiling the code and see exactly where the 
bottle necks
occur and I will do that but I'd like to get a good handle on the 
multiple ways to
index also.

Thanks for your time, Mike.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Tatu Saloranta <ta...@hypermall.net>.
On Tuesday 25 February 2003 03:48, Andrzej Bialecki wrote:
> petite_abeille wrote:
...
> > Multivalent's phelps.io.BufferedRandomAccessFile... and I'm happy to
> > report that... I doesn't seems to make a shred of difference in my
> > case... but as always YMMV.
>
> This is strange, or at least counter-intuitive - if you buffer larger
> parts of data in RAM than the standard implementation does, it should
> definitely be faster... Let's wait and see what Terry comes up with.
>
> BTW. how large indexes did you use for testing? Also, it could be that
> the indexing process is bound by some other bottleneck, and buffering
> helps only when searching already existing index.

Perhaps it also depends on platform -- on Linux (for example), all smallish 
files are very likely to be kept in memory, if accessed often. This because 
all non-allocated RAM is used for disk buffers automatically.
There is still syscall overhead for reading, but compared to actual disk reads 
it will be much faster.

-+ Tatu +-



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by petite_abeille <pe...@mac.com>.
On Tuesday, Feb 25, 2003, at 11:48 Europe/Zurich, Andrzej Bialecki 
wrote:

> This is strange, or at least counter-intuitive - if you buffer larger 
> parts of data in RAM than the standard implementation does, it should 
> definitely be faster... Let's wait and see what Terry comes up with.
>
> BTW. how large indexes did you use for testing?

A small testing set: around 100 MB.

> Also, it could be that the indexing process is bound by some other 
> bottleneck,

Most definitively.

>  and buffering helps only when searching already existing index.

Ooops... forgot to mention that the purpose of my testing was to test 
searching... I don't mind indexing speed that much... in any case... 
more generally I wanted to see if a buffered random access file would 
help in my peculiar situation... but no noticeable differences in my 
case one way or another... on the other hand... that could be just me 
as there is much more than straightforward Lucene indexing/searching 
going on. Let that not discourage you :-) In any case, Lucene itself is 
pretty speedy overall. The only bottleneck is index merging in my 
experience.

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Andrzej Bialecki <ab...@getopt.org>.
petite_abeille wrote:
> 
> On Tuesday, Feb 25, 2003, at 09:43 Europe/Zurich, Andrzej Bialecki wrote:
> 
>> No, I'm not - this is clearly stated in the class javadoc. I meant to 
>> try it out in my application, but haven't got to it yet - I need to 
>> address first the base functionality, not performance; so, I don't 
>> have the modified FSDirectory yet... The class is taken from 
>> Multivalent browser, and is subject to BSD-equivalent license - which 
>> means you can use it for whatever purpose, and if it turns out to be 
>> useful, it can be included in Lucene distribution.
> 
> 
> "Ask and ye shall get a random piece of somebody else mind"... and it 
> just so happen that recently I did some (not so rigorous) testing using 
> Multivalent's phelps.io.BufferedRandomAccessFile... and I'm happy to 
> report that... I doesn't seems to make a shred of difference in my 
> case... but as always YMMV.

This is strange, or at least counter-intuitive - if you buffer larger 
parts of data in RAM than the standard implementation does, it should 
definitely be faster... Let's wait and see what Terry comes up with.

BTW. how large indexes did you use for testing? Also, it could be that 
the indexing process is bound by some other bottleneck, and buffering 
helps only when searching already existing index.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by petite_abeille <pe...@mac.com>.
On Tuesday, Feb 25, 2003, at 09:43 Europe/Zurich, Andrzej Bialecki 
wrote:

> No, I'm not - this is clearly stated in the class javadoc. I meant to 
> try it out in my application, but haven't got to it yet - I need to 
> address first the base functionality, not performance; so, I don't 
> have the modified FSDirectory yet... The class is taken from 
> Multivalent browser, and is subject to BSD-equivalent license - which 
> means you can use it for whatever purpose, and if it turns out to be 
> useful, it can be included in Lucene distribution.

"Ask and ye shall get a random piece of somebody else mind"... and it 
just so happen that recently I did some (not so rigorous) testing using 
Multivalent's phelps.io.BufferedRandomAccessFile... and I'm happy to 
report that... I doesn't seems to make a shred of difference in my 
case... but as always YMMV.

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Andrzej Bialecki <ab...@getopt.org>.
Terry Steichen wrote:
> Hi Andrzej,
> 
> Thanks for the code.  I'll try it as soon as I have time.  If you had a copy
> of the modified FSDirectory implementation you could also share, that would
> make testing it a bit quicker and easier.  BTW, when you said it "supposedly
> increases I/O", I gather that you are not the author?


No, I'm not - this is clearly stated in the class javadoc. I meant to 
try it out in my application, but haven't got to it yet - I need to 
address first the base functionality, not performance; so, I don't have 
the modified FSDirectory yet... The class is taken from Multivalent 
browser, and is subject to BSD-equivalent license - which means you can 
use it for whatever purpose, and if it turns out to be useful, it can be 
included in Lucene distribution.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Andrzej Bialecki <ab...@getopt.org>.
Terry Steichen wrote:
> Hi Andrzej,
> 
> Thanks for the code.  I'll try it as soon as I have time.  If you had a copy
> of the modified FSDirectory implementation you could also share, that would
> make testing it a bit quicker and easier.

"Ask and ye shall receive..." :-) Be aware, though, that I tested it 
only with small doc. collections that I use for functionality testing... 
Everything appears to work as expected, but my test collection is just 
~100 documents, so the searching is blazingly fast no matter what I do.. :-)

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


Re: Indexing Tips and Hints

Posted by Terry Steichen <te...@net-frame.com>.
Hi Andrzej,

Thanks for the code.  I'll try it as soon as I have time.  If you had a copy
of the modified FSDirectory implementation you could also share, that would
make testing it a bit quicker and easier.  BTW, when you said it "supposedly
increases I/O", I gather that you are not the author?

Regards,

Terry

----- Original Message -----
From: "Andrzej Bialecki" <ab...@getopt.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, February 24, 2003 3:59 PM
Subject: Re: Indexing Tips and Hints


> Hello,
>
> Since you are trying this anyway, and looking for ways to improve
> indexing times... Could you perhaps try to replace use of
> java.io.RandomAccessFile in FSDirectory implementation, with the
> attached implementation? It supposedly increases I/O throughput by
> orders of magnitude, by using partial buffering.
>
> Terry Steichen wrote:
> > Mike,
> >
> > By way of comparison, I've got a collection of about 50,000 XML files,
each
> > of which averages about 8K.  It takes about 1.25 hours to index (on a
1.8Ghz
> > machine).  I use basically the standard configuration (mergeFactor,
etc.)
> > and I've got about 30 fields per document.  I add about 200 new ones per
> > day.  I don't recall how long that it takes to index the 200 (I do it
> > through a background task), but it takes a couple of minutes to merge
the
> > new 200 document index with the master index.
> >
> > HTH,
> >
> > Terry
> >
> > ----- Original Message -----
> > From: "Michael Barry" <mb...@cos.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Monday, February 24, 2003 2:00 PM
> > Subject: Indexing Tips and Hints
> >
> >
> >
> >>All,
> >>   I'm in need of some pointers, hints or tips on indexing large
> >
> > collections
> >
> >>of data. I know I saw some tips on this list before but when I tried
> >>searching
> >>the list, I came up blank.
> >>   I have a large collection of XML files (336000 files around 5K
> >>apiece) that I'm
> >>indexing and its taking quite a bit of time (27 hours). I've played
> >>around with the
> >>mergeFactor, RAMDirectories and multiple threads (X number of threads
> >>indexing
> >>a subset of the data and then merging the indexes at the end) but I
> >>cannot seem
> >>to bring the time down. I'm probably not doing these things properly but
> >>from
> >>what I read I believe I am.  Maybe this is the best I can do with this
> >>data but I
> >>would be really grateful to hear how others have tackled this same
issue.
> >>   As always pointers to places in the mailing list archive or other
> >>places would be
> >>appreciated.
> >>
> >>Thanks, Mike.
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
>
>
> --
>
> --
> Best regards,
> Andrzej Bialecki
>
> -------------------------------------------------
> Software Architect, System Integration Specialist
> -------------------------------------------------
> FreeBSD developer (http://www.freebsd.org)
>
>


----------------------------------------------------------------------------
----


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Doug Cutting <cu...@lucene.com>.
I doubt this will make Lucene much faster, since Lucene already 
implements buffering in its InputStream and OutputStream classes.  So 
Lucene already has this optimization built-in.

Doug

Andrzej Bialecki wrote:
> Hello,
> 
> Since you are trying this anyway, and looking for ways to improve 
> indexing times... Could you perhaps try to replace use of 
> java.io.RandomAccessFile in FSDirectory implementation, with the 
> attached implementation? It supposedly increases I/O throughput by 
> orders of magnitude, by using partial buffering.
> 
> Terry Steichen wrote:
> 
>> Mike,
>>
>> By way of comparison, I've got a collection of about 50,000 XML files, 
>> each
>> of which averages about 8K.  It takes about 1.25 hours to index (on a 
>> 1.8Ghz
>> machine).  I use basically the standard configuration (mergeFactor, etc.)
>> and I've got about 30 fields per document.  I add about 200 new ones per
>> day.  I don't recall how long that it takes to index the 200 (I do it
>> through a background task), but it takes a couple of minutes to merge the
>> new 200 document index with the master index.
>>
>> HTH,
>>
>> Terry
>>
>> ----- Original Message -----
>> From: "Michael Barry" <mb...@cos.com>
>> To: "Lucene Users List" <lu...@jakarta.apache.org>
>> Sent: Monday, February 24, 2003 2:00 PM
>> Subject: Indexing Tips and Hints
>>
>>
>>
>>> All,
>>>   I'm in need of some pointers, hints or tips on indexing large
>>
>>
>> collections
>>
>>> of data. I know I saw some tips on this list before but when I tried
>>> searching
>>> the list, I came up blank.
>>>   I have a large collection of XML files (336000 files around 5K
>>> apiece) that I'm
>>> indexing and its taking quite a bit of time (27 hours). I've played
>>> around with the
>>> mergeFactor, RAMDirectories and multiple threads (X number of threads
>>> indexing
>>> a subset of the data and then merging the indexes at the end) but I
>>> cannot seem
>>> to bring the time down. I'm probably not doing these things properly but
>>> from
>>> what I read I believe I am.  Maybe this is the best I can do with this
>>> data but I
>>> would be really grateful to hear how others have tackled this same 
>>> issue.
>>>   As always pointers to places in the mailing list archive or other
>>> places would be
>>> appreciated.
>>>
>>> Thanks, Mike.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Andrzej Bialecki <ab...@getopt.org>.
Hello,

Since you are trying this anyway, and looking for ways to improve 
indexing times... Could you perhaps try to replace use of 
java.io.RandomAccessFile in FSDirectory implementation, with the 
attached implementation? It supposedly increases I/O throughput by 
orders of magnitude, by using partial buffering.

Terry Steichen wrote:
> Mike,
> 
> By way of comparison, I've got a collection of about 50,000 XML files, each
> of which averages about 8K.  It takes about 1.25 hours to index (on a 1.8Ghz
> machine).  I use basically the standard configuration (mergeFactor, etc.)
> and I've got about 30 fields per document.  I add about 200 new ones per
> day.  I don't recall how long that it takes to index the 200 (I do it
> through a background task), but it takes a couple of minutes to merge the
> new 200 document index with the master index.
> 
> HTH,
> 
> Terry
> 
> ----- Original Message -----
> From: "Michael Barry" <mb...@cos.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Monday, February 24, 2003 2:00 PM
> Subject: Indexing Tips and Hints
> 
> 
> 
>>All,
>>   I'm in need of some pointers, hints or tips on indexing large
> 
> collections
> 
>>of data. I know I saw some tips on this list before but when I tried
>>searching
>>the list, I came up blank.
>>   I have a large collection of XML files (336000 files around 5K
>>apiece) that I'm
>>indexing and its taking quite a bit of time (27 hours). I've played
>>around with the
>>mergeFactor, RAMDirectories and multiple threads (X number of threads
>>indexing
>>a subset of the data and then merging the indexes at the end) but I
>>cannot seem
>>to bring the time down. I'm probably not doing these things properly but
>>from
>>what I read I believe I am.  Maybe this is the best I can do with this
>>data but I
>>would be really grateful to hear how others have tackled this same issue.
>>   As always pointers to places in the mailing list archive or other
>>places would be
>>appreciated.
>>
>>Thanks, Mike.
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> 


-- 

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


Re: Indexing Tips and Hints

Posted by Terry Steichen <te...@net-frame.com>.
Mike,

By way of comparison, I've got a collection of about 50,000 XML files, each
of which averages about 8K.  It takes about 1.25 hours to index (on a 1.8Ghz
machine).  I use basically the standard configuration (mergeFactor, etc.)
and I've got about 30 fields per document.  I add about 200 new ones per
day.  I don't recall how long that it takes to index the 200 (I do it
through a background task), but it takes a couple of minutes to merge the
new 200 document index with the master index.

HTH,

Terry

----- Original Message -----
From: "Michael Barry" <mb...@cos.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, February 24, 2003 2:00 PM
Subject: Indexing Tips and Hints


> All,
>    I'm in need of some pointers, hints or tips on indexing large
collections
> of data. I know I saw some tips on this list before but when I tried
> searching
> the list, I came up blank.
>    I have a large collection of XML files (336000 files around 5K
> apiece) that I'm
> indexing and its taking quite a bit of time (27 hours). I've played
> around with the
> mergeFactor, RAMDirectories and multiple threads (X number of threads
> indexing
> a subset of the data and then merging the indexes at the end) but I
> cannot seem
> to bring the time down. I'm probably not doing these things properly but
> from
> what I read I believe I am.  Maybe this is the best I can do with this
> data but I
> would be really grateful to hear how others have tackled this same issue.
>    As always pointers to places in the mailing list archive or other
> places would be
> appreciated.
>
> Thanks, Mike.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Tips and Hints

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Things to consider:
- disk speed and whether it is busy satisfying other processes'
requests
- CPU speed
- amount or free RAM in the machine and amount of RAM given to your JVM
- the bottleneck - could be a slow XML parser, for instance, profile it

I'm about to submit another Lucene article to Onjava.com.  It talks
about the performance of indexing.  I don't know when exactly it will
be published, but when it does I'll send the URL to the list.

Otis



--- Michael Barry <mb...@cos.com> wrote:
> All,
>    I'm in need of some pointers, hints or tips on indexing large
> collections
> of data. I know I saw some tips on this list before but when I tried 
> searching
> the list, I came up blank.
>    I have a large collection of XML files (336000 files around 5K 
> apiece) that I'm
> indexing and its taking quite a bit of time (27 hours). I've played 
> around with the
> mergeFactor, RAMDirectories and multiple threads (X number of threads
> 
> indexing
> a subset of the data and then merging the indexes at the end) but I 
> cannot seem
> to bring the time down. I'm probably not doing these things properly
> but 
> from
> what I read I believe I am.  Maybe this is the best I can do with
> this 
> data but I
> would be really grateful to hear how others have tackled this same
> issue.
>    As always pointers to places in the mailing list archive or other 
> places would be
> appreciated.
> 
> Thanks, Mike.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org