You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Wechner <mi...@wyona.com> on 2004/07/05 13:37:00 UTC

Re: incrementally indexing a million documents

Nader Henein wrote:

> How are you documents named? is it alphabetical 


alphabetically

> or numerical, Mine where numerical so I I creates n directories like so
> 11 , 12, 13, 14, .... 19, 21 , 22 , 23 .. ........ 99   you get the idea 



right, but don't I loose all the performance on sorting and could 
instead rebuild the index from scratch ... ?

Thanks

Michi


> and I stored the files into the directories that each belonged to 
> depending on the last two numbers in the file name (you could use file 
> size to shuffle the files around as well (ie, use the 2 rightmost  
> numbers in the file size in bytes)  so at this point you'll have 
> shuffled your million docs into 100 directories and then  Lucene can 
> spider through each set of directories indexing let's say 5000 files 
> at a time and then deleting them or moving them into another location, 
> it you get 100 million files simply up the precision on the directory 
> to a 3 digit setup or a 4 digit setup (once you automate it, sky's the 
> limit)
> Hope this helps
>
> Nader Henein
>
>
> Michael Wechner wrote:
>
>> I try to index around a million documents. The problem is
>> that I run out of memory during sorting by uid when I go through
>> the directory recursively.
>>
>> Well, I could add more memory, but this wouldn't really solve my 
>> problem,
>> because at some point I will always run out of memory (e.g. 10 
>> million documents).
>>
>> Is there another approach than sorting by uid?
>>
>> Thanks
>>
>> Michi
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com              http://cocoon.apache.org/lenya/
michael.wechner@wyona.com                        michi@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org