You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rui Wang <rw...@ebi.ac.uk> on 2011/12/05 18:58:29 UTC

Use multiple lucene indices

Hi All,

We are planning to use lucene in our project, but not entirely sure about some of the design decisions were made. Below are the details, any comments/suggestions are more than welcome.

The requirements of the project are below:

1. We have tens of thousands of files, their size ranging from 500M to a few terabytes, and majority of the contents in these files will not be accessed frequently.

2. We are planning to keep less accessed contents outside of our database, store them on the file system.

3. We also have code to get the binary position of these contents in the files. Using these binary positions, we can quickly retrieve the contents and convert them into our domain objects.

We think Lucene provides a scalable solution for storing and indexing these binary positions, so the idea is that each piece of the content in the files will a document, each document will have at least an ID field to identify to content and a binary position field contains the starting and stop position of the content. Having done some performance testing, it seems to us that Lucene is well capable of doing this.

At the moment, we are planning to create one Lucene index per file, so if we have new files to be added to the system, we can simply generate a new index. The problem is do with searching, this approach means that we need to create an new IndexSearcher every time a file is accessed through our web service. We knew that it is rather expensive to open a new IndexSearcher, and are thinking of using some kind of pooling mechanism. Our questions are:

1. Is this one index per file approach a viable solution? What do you think about pooling IndexSearcher?

2. If we have many IndexSearchers opened at the same time, would the memory usage go through the roof? I couldn't find any document on how Lucene use allocate memory.

Thank you very much for your help.

Many thanks,
Rui Wang
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re:Use multiple lucene indices

Posted by liugangc <li...@gmail.com>.

hi, below is some hints from my experience:
1. if you use one index per file, and many indexsearcher open at the same time, you may meet 'too many open files' error. you have to increase file_max value of os. 
2. if  these index files have less concurrent access, i think it's reasonable that open new searcher for every access. meanwhile, if you use lucene sort feature, field cache may consume many memory. thus  too many opened indexsearcher at the same time could exhaust all memory of your machine.


--
gang liu
email: liugangc@gmail.com



At 2011-12-06 01:58:29,"Rui Wang" <rw...@ebi.ac.uk> wrote:
>Hi All, 
>
>We are planning to use lucene in our project, but not entirely sure about some of the design decisions were made. Below are the details, any comments/suggestions are more than welcome. 
>
>The requirements of the project are below:
>
>1. We have  tens of thousands of files, their size ranging from 500M to a few terabytes, and majority of the contents in these files will not be accessed frequently. 
>
>2. We are planning to keep less accessed contents outside of our database, store them on the file system.
>
>3. We also have code to get the binary position of these contents in the files. Using these binary positions, we can quickly retrieve the contents and convert them into our domain objects. 
>
>We think Lucene provides a scalable solution for storing and indexing these binary positions, so the idea is that each piece of the content in the files will a document, each document will have at least an ID field to identify to content and a binary position field contains the starting and stop position of the content. Having done some performance testing, it seems to us that Lucene is well capable of doing this. 
>
>At the moment, we are planning to create one Lucene index per file, so if we have new files to be added to the system, we can simply generate a new index. The problem is do with searching, this approach means that we need to create an new IndexSearcher every time a file is accessed through our web service. We knew that it is rather expensive to open a new IndexSearcher, and are thinking of using some kind of pooling mechanism. Our questions are:
>
>1. Is this one index per file approach a viable solution? What do you think about pooling IndexSearcher?
>
>2. If we have many IndexSearchers opened at the same time, would the memory usage go through the roof? I couldn't find any document on how Lucene use allocate memory. 
>
>Thank you very much for your help. 
>
>Many thanks,
>Rui Wang
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: Use multiple lucene indices

Posted by "Francisco A. Lozano" <fl...@gmail.com>.

I have that use-case too: lots of indexes and each request is handled
by only one well-known index. For us it's working very well (but our
indexes are *small*- 1k-10k entries).

What we do is open/close the index reader / writer each time it's
needed, and reuse it if two requests need to access the same index at
the same time... but we don't keep them opened: when the latest
consumer finishes, we close the resource immediately.

This LFU eviction technique is something we had in mind. So far it's
working well without it, but it would be nice to see it working. Any
example out there on how to implement it properly?

Francisco A. Lozano



On Wed, Dec 7, 2011 at 08:46, Danil ŢORIN <to...@gmail.com> wrote:
> 10B documents is a lot of data.
>
> Index/file won't scale: you will not be able to open all the indexes in the
> same time (filehandlers limits, memory limits, etc), and if you'll
> search through them sequentially, it will take a lot of time.
>
> Unless in your usecase you always know the file you are searching, in this
> case you could open just one index at a time, search it, and close it.
> In this case index/file is a good and scalable solution.
> (There will be a penalty of fresh open of the index, but 500K docs/index
> should be quite quick to open, you may want to maintain a pool of opened
> indexes with LFU eviction, so repeated request will reuse already opened
> IndexReader, and old/unused indexReaders will be closed to free the
> resources)
>
>
>
> ehcache has a possibility to keep some entries in the memory (let say few
> thousand) and the rest of the cache to be persisted on disk.
> So the memory usage is not the issue, you could run it with 64M of JVM, and
> let OS to handle the rest.
>
> On Tue, Dec 6, 2011 at 12:26, Rui Wang <rw...@ebi.ac.uk> wrote:
>
>> Hi Danil,
>>
>> Thank you for your suggestions.
>>
>> We will have approximately half million documents per file, so using your
>> calculation, 20000 files * 500000 = 10, 000, 000, 000. And we are likely to
>> get more files in the future, so a scalable solution is most desirable.
>>
>> The document IDs are not unique between files, so we will have to filter
>> by file name as well. echcahe is certainly an interesting idea, does it
>> have the comparable load speed as a Lucene index, what about memory
>> footprint?
>>
>> Another thing I should have mentioned before, we will add a few files (say
>> 10) per day, this means we need to update indices on a regular basis, hence
>> the reason why we were thinking of generating one index per file.
>>
>> Am I right to say that you would definitely not go for one index per file
>> solution? is it also due to memory consumption?
>>
>> Many thanks,
>> Rui Wang
>>
>>
>> On 6 Dec 2011, at 10:05, Danil ŢORIN wrote:
>>
>> > How many documents there are in the system ?
>> > approximate it by: 20000 files * avg(docs/file)
>> >
>> > From my understanding your queries will be just lookup for a document ID
>> > (Q: are those IDs unique between files? or you need to filter by
>> filename?)
>> > If that will be the only usecase than maybe you should consider some
>> other
>> > lookup systems, a ehcache offloaded and persistent on disk might work
>> just
>> > as well.
>> >
>> > If you are anywhere < 200 mln documents I'd say you should go with a
>> single
>> > index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
>> > In a slightly beefier host and Lucene4 (try various codecs for
>> speed/memory
>> > usage) I think you could go to 1 bln documents.
>> >
>> > If you plan on more complex queries..like given a position in a file,
>> > identify a document that contains it...than the number of documents
>> should
>> > be reconsidered.
>> >
>> > In worst case case scenario I would go with partitioned index (5-10
>> > partitions, but not thousands)
>> >
>> >
>> > On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote:
>> >
>> >> Hi Guys,
>> >>
>> >> Thank you very much for your answers.
>> >>
>> >> I will do some profiling on memory usage, but is there any documentation
>> >> on how Lucene uses/allocates the memory?
>> >>
>> >> Best wishes,
>> >> Rui Wang
>> >>
>> >>
>> >> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
>> >>
>> >>> hi
>> >>>
>> >>>>> would the memory usage go through the roof?
>> >>>
>> >>> Yup ....
>> >>>
>> >>> My past experience got me pickels  in there...
>> >>>
>> >>>
>> >>>
>> >>> with regards
>> >>> karthik
>> >>>
>> >>> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:
>> >>>
>> >>>> Hi All,
>> >>>>
>> >>>> We are planning to use lucene in our project, but not entirely sure
>> >> about
>> >>>> some of the design decisions were made. Below are the details, any
>> >>>> comments/suggestions are more than welcome.
>> >>>>
>> >>>> The requirements of the project are below:
>> >>>>
>> >>>> 1. We have  tens of thousands of files, their size ranging from 500M
>> to
>> >> a
>> >>>> few terabytes, and majority of the contents in these files will not be
>> >>>> accessed frequently.
>> >>>>
>> >>>> 2. We are planning to keep less accessed contents outside of our
>> >> database,
>> >>>> store them on the file system.
>> >>>>
>> >>>> 3. We also have code to get the binary position of these contents in
>> the
>> >>>> files. Using these binary positions, we can quickly retrieve the
>> >> contents
>> >>>> and convert them into our domain objects.
>> >>>>
>> >>>> We think Lucene provides a scalable solution for storing and indexing
>> >>>> these binary positions, so the idea is that each piece of the content
>> in
>> >>>> the files will a document, each document will have at least an ID
>> field
>> >> to
>> >>>> identify to content and a binary position field contains the starting
>> >> and
>> >>>> stop position of the content. Having done some performance testing, it
>> >>>> seems to us that Lucene is well capable of doing this.
>> >>>>
>> >>>> At the moment, we are planning to create one Lucene index per file, so
>> >> if
>> >>>> we have new files to be added to the system, we can simply generate a
>> >> new
>> >>>> index. The problem is do with searching, this approach means that we
>> >> need
>> >>>> to create an new IndexSearcher every time a file is accessed through
>> our
>> >>>> web service. We knew that it is rather expensive to open a new
>> >>>> IndexSearcher, and are thinking of using some kind of pooling
>> mechanism.
>> >>>> Our questions are:
>> >>>>
>> >>>> 1. Is this one index per file approach a viable solution? What do you
>> >>>> think about pooling IndexSearcher?
>> >>>>
>> >>>> 2. If we have many IndexSearchers opened at the same time, would the
>> >>>> memory usage go through the roof? I couldn't find any document on how
>> >>>> Lucene use allocate memory.
>> >>>>
>> >>>> Thank you very much for your help.
>> >>>>
>> >>>> Many thanks,
>> >>>> Rui Wang
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> *N.S.KARTHIK
>> >>> R.M.S.COLONY
>> >>> BEHIND BANK OF INDIA
>> >>> R.M.V 2ND STAGE
>> >>> BANGALORE
>> >>> 560094*
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Use multiple lucene indices

Posted by Rui Wang <rw...@ebi.ac.uk>.

Hi Danil,

Thank you for answering once again. 

You are right that we always know the file we are searching, the file location is stored in a database. 

Having done some testing, it seems to me that use index/file yields reasonable performance just like you suggested. 

For a 500K docs/index, I measured the index load time plus querying and getting the result back. It takes around 350 milliseconds. Also the memory footprint is around 1.5 M. 

Many thanks,
Rui Wang 
On 7 Dec 2011, at 07:46, Danil ŢORIN wrote:

> 10B documents is a lot of data.
> 
> Index/file won't scale: you will not be able to open all the indexes in the
> same time (filehandlers limits, memory limits, etc), and if you'll
> search through them sequentially, it will take a lot of time.
> 
> Unless in your usecase you always know the file you are searching, in this
> case you could open just one index at a time, search it, and close it.
> In this case index/file is a good and scalable solution.
> (There will be a penalty of fresh open of the index, but 500K docs/index
> should be quite quick to open, you may want to maintain a pool of opened
> indexes with LFU eviction, so repeated request will reuse already opened
> IndexReader, and old/unused indexReaders will be closed to free the
> resources)
> 
> 
> 
> ehcache has a possibility to keep some entries in the memory (let say few
> thousand) and the rest of the cache to be persisted on disk.
> So the memory usage is not the issue, you could run it with 64M of JVM, and
> let OS to handle the rest.
> 
> On Tue, Dec 6, 2011 at 12:26, Rui Wang <rw...@ebi.ac.uk> wrote:
> 
>> Hi Danil,
>> 
>> Thank you for your suggestions.
>> 
>> We will have approximately half million documents per file, so using your
>> calculation, 20000 files * 500000 = 10, 000, 000, 000. And we are likely to
>> get more files in the future, so a scalable solution is most desirable.
>> 
>> The document IDs are not unique between files, so we will have to filter
>> by file name as well. echcahe is certainly an interesting idea, does it
>> have the comparable load speed as a Lucene index, what about memory
>> footprint?
>> 
>> Another thing I should have mentioned before, we will add a few files (say
>> 10) per day, this means we need to update indices on a regular basis, hence
>> the reason why we were thinking of generating one index per file.
>> 
>> Am I right to say that you would definitely not go for one index per file
>> solution? is it also due to memory consumption?
>> 
>> Many thanks,
>> Rui Wang
>> 
>> 
>> On 6 Dec 2011, at 10:05, Danil ŢORIN wrote:
>> 
>>> How many documents there are in the system ?
>>> approximate it by: 20000 files * avg(docs/file)
>>> 
>>> From my understanding your queries will be just lookup for a document ID
>>> (Q: are those IDs unique between files? or you need to filter by
>> filename?)
>>> If that will be the only usecase than maybe you should consider some
>> other
>>> lookup systems, a ehcache offloaded and persistent on disk might work
>> just
>>> as well.
>>> 
>>> If you are anywhere < 200 mln documents I'd say you should go with a
>> single
>>> index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
>>> In a slightly beefier host and Lucene4 (try various codecs for
>> speed/memory
>>> usage) I think you could go to 1 bln documents.
>>> 
>>> If you plan on more complex queries..like given a position in a file,
>>> identify a document that contains it...than the number of documents
>> should
>>> be reconsidered.
>>> 
>>> In worst case case scenario I would go with partitioned index (5-10
>>> partitions, but not thousands)
>>> 
>>> 
>>> On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote:
>>> 
>>>> Hi Guys,
>>>> 
>>>> Thank you very much for your answers.
>>>> 
>>>> I will do some profiling on memory usage, but is there any documentation
>>>> on how Lucene uses/allocates the memory?
>>>> 
>>>> Best wishes,
>>>> Rui Wang
>>>> 
>>>> 
>>>> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
>>>> 
>>>>> hi
>>>>> 
>>>>>>> would the memory usage go through the roof?
>>>>> 
>>>>> Yup ....
>>>>> 
>>>>> My past experience got me pickels  in there...
>>>>> 
>>>>> 
>>>>> 
>>>>> with regards
>>>>> karthik
>>>>> 
>>>>> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> We are planning to use lucene in our project, but not entirely sure
>>>> about
>>>>>> some of the design decisions were made. Below are the details, any
>>>>>> comments/suggestions are more than welcome.
>>>>>> 
>>>>>> The requirements of the project are below:
>>>>>> 
>>>>>> 1. We have  tens of thousands of files, their size ranging from 500M
>> to
>>>> a
>>>>>> few terabytes, and majority of the contents in these files will not be
>>>>>> accessed frequently.
>>>>>> 
>>>>>> 2. We are planning to keep less accessed contents outside of our
>>>> database,
>>>>>> store them on the file system.
>>>>>> 
>>>>>> 3. We also have code to get the binary position of these contents in
>> the
>>>>>> files. Using these binary positions, we can quickly retrieve the
>>>> contents
>>>>>> and convert them into our domain objects.
>>>>>> 
>>>>>> We think Lucene provides a scalable solution for storing and indexing
>>>>>> these binary positions, so the idea is that each piece of the content
>> in
>>>>>> the files will a document, each document will have at least an ID
>> field
>>>> to
>>>>>> identify to content and a binary position field contains the starting
>>>> and
>>>>>> stop position of the content. Having done some performance testing, it
>>>>>> seems to us that Lucene is well capable of doing this.
>>>>>> 
>>>>>> At the moment, we are planning to create one Lucene index per file, so
>>>> if
>>>>>> we have new files to be added to the system, we can simply generate a
>>>> new
>>>>>> index. The problem is do with searching, this approach means that we
>>>> need
>>>>>> to create an new IndexSearcher every time a file is accessed through
>> our
>>>>>> web service. We knew that it is rather expensive to open a new
>>>>>> IndexSearcher, and are thinking of using some kind of pooling
>> mechanism.
>>>>>> Our questions are:
>>>>>> 
>>>>>> 1. Is this one index per file approach a viable solution? What do you
>>>>>> think about pooling IndexSearcher?
>>>>>> 
>>>>>> 2. If we have many IndexSearchers opened at the same time, would the
>>>>>> memory usage go through the roof? I couldn't find any document on how
>>>>>> Lucene use allocate memory.
>>>>>> 
>>>>>> Thank you very much for your help.
>>>>>> 
>>>>>> Many thanks,
>>>>>> Rui Wang
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> *N.S.KARTHIK
>>>>> R.M.S.COLONY
>>>>> BEHIND BANK OF INDIA
>>>>> R.M.V 2ND STAGE
>>>>> BANGALORE
>>>>> 560094*
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Use multiple lucene indices

Posted by Danil ŢORIN <to...@gmail.com>.

10B documents is a lot of data.

Index/file won't scale: you will not be able to open all the indexes in the
same time (filehandlers limits, memory limits, etc), and if you'll
search through them sequentially, it will take a lot of time.

Unless in your usecase you always know the file you are searching, in this
case you could open just one index at a time, search it, and close it.
In this case index/file is a good and scalable solution.
(There will be a penalty of fresh open of the index, but 500K docs/index
should be quite quick to open, you may want to maintain a pool of opened
indexes with LFU eviction, so repeated request will reuse already opened
IndexReader, and old/unused indexReaders will be closed to free the
resources)



ehcache has a possibility to keep some entries in the memory (let say few
thousand) and the rest of the cache to be persisted on disk.
So the memory usage is not the issue, you could run it with 64M of JVM, and
let OS to handle the rest.

On Tue, Dec 6, 2011 at 12:26, Rui Wang <rw...@ebi.ac.uk> wrote:

> Hi Danil,
>
> Thank you for your suggestions.
>
> We will have approximately half million documents per file, so using your
> calculation, 20000 files * 500000 = 10, 000, 000, 000. And we are likely to
> get more files in the future, so a scalable solution is most desirable.
>
> The document IDs are not unique between files, so we will have to filter
> by file name as well. echcahe is certainly an interesting idea, does it
> have the comparable load speed as a Lucene index, what about memory
> footprint?
>
> Another thing I should have mentioned before, we will add a few files (say
> 10) per day, this means we need to update indices on a regular basis, hence
> the reason why we were thinking of generating one index per file.
>
> Am I right to say that you would definitely not go for one index per file
> solution? is it also due to memory consumption?
>
> Many thanks,
> Rui Wang
>
>
> On 6 Dec 2011, at 10:05, Danil ŢORIN wrote:
>
> > How many documents there are in the system ?
> > approximate it by: 20000 files * avg(docs/file)
> >
> > From my understanding your queries will be just lookup for a document ID
> > (Q: are those IDs unique between files? or you need to filter by
> filename?)
> > If that will be the only usecase than maybe you should consider some
> other
> > lookup systems, a ehcache offloaded and persistent on disk might work
> just
> > as well.
> >
> > If you are anywhere < 200 mln documents I'd say you should go with a
> single
> > index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
> > In a slightly beefier host and Lucene4 (try various codecs for
> speed/memory
> > usage) I think you could go to 1 bln documents.
> >
> > If you plan on more complex queries..like given a position in a file,
> > identify a document that contains it...than the number of documents
> should
> > be reconsidered.
> >
> > In worst case case scenario I would go with partitioned index (5-10
> > partitions, but not thousands)
> >
> >
> > On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote:
> >
> >> Hi Guys,
> >>
> >> Thank you very much for your answers.
> >>
> >> I will do some profiling on memory usage, but is there any documentation
> >> on how Lucene uses/allocates the memory?
> >>
> >> Best wishes,
> >> Rui Wang
> >>
> >>
> >> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
> >>
> >>> hi
> >>>
> >>>>> would the memory usage go through the roof?
> >>>
> >>> Yup ....
> >>>
> >>> My past experience got me pickels  in there...
> >>>
> >>>
> >>>
> >>> with regards
> >>> karthik
> >>>
> >>> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> We are planning to use lucene in our project, but not entirely sure
> >> about
> >>>> some of the design decisions were made. Below are the details, any
> >>>> comments/suggestions are more than welcome.
> >>>>
> >>>> The requirements of the project are below:
> >>>>
> >>>> 1. We have  tens of thousands of files, their size ranging from 500M
> to
> >> a
> >>>> few terabytes, and majority of the contents in these files will not be
> >>>> accessed frequently.
> >>>>
> >>>> 2. We are planning to keep less accessed contents outside of our
> >> database,
> >>>> store them on the file system.
> >>>>
> >>>> 3. We also have code to get the binary position of these contents in
> the
> >>>> files. Using these binary positions, we can quickly retrieve the
> >> contents
> >>>> and convert them into our domain objects.
> >>>>
> >>>> We think Lucene provides a scalable solution for storing and indexing
> >>>> these binary positions, so the idea is that each piece of the content
> in
> >>>> the files will a document, each document will have at least an ID
> field
> >> to
> >>>> identify to content and a binary position field contains the starting
> >> and
> >>>> stop position of the content. Having done some performance testing, it
> >>>> seems to us that Lucene is well capable of doing this.
> >>>>
> >>>> At the moment, we are planning to create one Lucene index per file, so
> >> if
> >>>> we have new files to be added to the system, we can simply generate a
> >> new
> >>>> index. The problem is do with searching, this approach means that we
> >> need
> >>>> to create an new IndexSearcher every time a file is accessed through
> our
> >>>> web service. We knew that it is rather expensive to open a new
> >>>> IndexSearcher, and are thinking of using some kind of pooling
> mechanism.
> >>>> Our questions are:
> >>>>
> >>>> 1. Is this one index per file approach a viable solution? What do you
> >>>> think about pooling IndexSearcher?
> >>>>
> >>>> 2. If we have many IndexSearchers opened at the same time, would the
> >>>> memory usage go through the roof? I couldn't find any document on how
> >>>> Lucene use allocate memory.
> >>>>
> >>>> Thank you very much for your help.
> >>>>
> >>>> Many thanks,
> >>>> Rui Wang
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> *N.S.KARTHIK
> >>> R.M.S.COLONY
> >>> BEHIND BANK OF INDIA
> >>> R.M.V 2ND STAGE
> >>> BANGALORE
> >>> 560094*
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: tokenizing text using language analyzer but preserving stopwords if possible

Posted by KARTHIK SHIVAKUMAR <ns...@gmail.com>.

Hi

>> tokenize the original foreign text into words

Need to Identify the Appropriate analyzer ( foreign language before
Indexing ...)


with regards
karthik


On Wed, Dec 7, 2011 at 4:57 PM, Avi Rosenschein <ar...@gmail.com>wrote:

> On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <iz...@caci.com> wrote:
>
> > I need to implement a "quick and dirty" or "poor man's" translation of a
> > foreign language document by looking up each word in a dictionary and
> > replacing it with the English translation. So what I need is to tokenize
> > the original foreign text into words and then access each word, look it
> up
> > and get its translation. However, if possible, I also need to preserve
> > "non-words", i.e. stopwords so that I could replicate them in the output
> > stream without translating. If the latter is not possible then I just
> need
> > to preserve the order of the original words so that their translations
> have
> > the same order in the output.
> >
> > Can I accomplish this using Lucene components? I presume I'd have to
> start
> > by creating an analyzer for the foreign language, but then what? How do I
> > (i) tokenize, (ii) access words in the correct order, (iii) also access
> > non-words if possible?
> >
>
> You can always use something like StandardAnalyzer for the specific
> language, with an empty stopword list (so that no words are treated as
> stopwords). A bit trickier might be dealing with punctuation - depending on
> the analyzer, you might be able to get these to parse as separate tokens.
>
> -- Avi
>
>
> >
> > Thanks much
> >
> >
> > Ilya Zavorin
> >
> >
> >
>



-- 
*N.S.KARTHIK
R.M.S.COLONY
BEHIND BANK OF INDIA
R.M.V 2ND STAGE
BANGALORE
560094*

Re: Improving Lucene Search Performance

Posted by Ian Lea <ia...@gmail.com>.

See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed.  Some of
the tips relate to indexing but most to search time stuff.


--
Ian.


On Thu, Dec 8, 2011 at 10:45 AM, Dilshad K. P. <di...@nestgroup.net> wrote:
> Hi,
> Is there any thing to take care while creating index for improving lucene text search speed.
>
> Thanks And Regards
> Dilshad K.P
> ***** Confidentiality Statement/Disclaimer *****
>
> This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt.
> The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Improving Lucene Search Performance

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Improving Lucene Search Performance
: In-Reply-To:
:     <CA...@mail.gmail.com>
: References:
:     <16...@ebi.ac.uk><CAFVhWXieRFqstbGPi+wM1zhZ
:     LL0SMr0uz8+7CUhsHPYdUWQpQA@mail.gmail.com><347A161B-6C7B-4DC3-ACD0-9A804E2
:     DD36C@ebi.ac.uk><CABYvkPR3_14cTaorH-hQ+uYMvvRBMQx5GWzuNAYmE+PYp=fLsg@mail.
:     gmail.com><00...@ebi.ac.uk><A57498EDEC10C64
:     781EA0F7DBA665CEF019DE1@ex2010mb01-2.caci.com>
:  <CA...@mail.gmail.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Improving Lucene Search Performance

Posted by "Dilshad K. P." <di...@nestgroup.net>.

Hi,
Is there any thing to take care while creating index for improving lucene text search speed.

Thanks And Regards
Dilshad K.P
***** Confidentiality Statement/Disclaimer *****

This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt.
The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.

Re: tokenizing text using language analyzer but preserving stopwords if possible

Posted by Avi Rosenschein <ar...@gmail.com>.

On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <iz...@caci.com> wrote:

> I need to implement a "quick and dirty" or "poor man's" translation of a
> foreign language document by looking up each word in a dictionary and
> replacing it with the English translation. So what I need is to tokenize
> the original foreign text into words and then access each word, look it up
> and get its translation. However, if possible, I also need to preserve
> "non-words", i.e. stopwords so that I could replicate them in the output
> stream without translating. If the latter is not possible then I just need
> to preserve the order of the original words so that their translations have
> the same order in the output.
>
> Can I accomplish this using Lucene components? I presume I'd have to start
> by creating an analyzer for the foreign language, but then what? How do I
> (i) tokenize, (ii) access words in the correct order, (iii) also access
> non-words if possible?
>

You can always use something like StandardAnalyzer for the specific
language, with an empty stopword list (so that no words are treated as
stopwords). A bit trickier might be dealing with punctuation - depending on
the analyzer, you might be able to get these to parse as separate tokens.

-- Avi


>
> Thanks much
>
>
> Ilya Zavorin
>
>
>

tokenizing text using language analyzer but preserving stopwords if possible

Posted by Ilya Zavorin <iz...@caci.com>.

I need to implement a "quick and dirty" or "poor man's" translation of a foreign language document by looking up each word in a dictionary and replacing it with the English translation. So what I need is to tokenize the original foreign text into words and then access each word, look it up and get its translation. However, if possible, I also need to preserve "non-words", i.e. stopwords so that I could replicate them in the output stream without translating. If the latter is not possible then I just need to preserve the order of the original words so that their translations have the same order in the output.

Can I accomplish this using Lucene components? I presume I'd have to start by creating an analyzer for the foreign language, but then what? How do I (i) tokenize, (ii) access words in the correct order, (iii) also access non-words if possible?

Thanks much


Ilya Zavorin

Re: Use multiple lucene indices

Posted by Rui Wang <rw...@ebi.ac.uk>.

Hi Danil,

Thank you for your suggestions.

We will have approximately half million documents per file, so using your calculation, 20000 files * 500000 = 10, 000, 000, 000. And we are likely to get more files in the future, so a scalable solution is most desirable. 

The document IDs are not unique between files, so we will have to filter by file name as well. echcahe is certainly an interesting idea, does it have the comparable load speed as a Lucene index, what about memory footprint?

Another thing I should have mentioned before, we will add a few files (say 10) per day, this means we need to update indices on a regular basis, hence the reason why we were thinking of generating one index per file. 

Am I right to say that you would definitely not go for one index per file solution? is it also due to memory consumption? 

Many thanks,
Rui Wang


On 6 Dec 2011, at 10:05, Danil ŢORIN wrote:

> How many documents there are in the system ?
> approximate it by: 20000 files * avg(docs/file)
> 
> From my understanding your queries will be just lookup for a document ID
> (Q: are those IDs unique between files? or you need to filter by filename?)
> If that will be the only usecase than maybe you should consider some other
> lookup systems, a ehcache offloaded and persistent on disk might work just
> as well.
> 
> If you are anywhere < 200 mln documents I'd say you should go with a single
> index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
> In a slightly beefier host and Lucene4 (try various codecs for speed/memory
> usage) I think you could go to 1 bln documents.
> 
> If you plan on more complex queries..like given a position in a file,
> identify a document that contains it...than the number of documents should
> be reconsidered.
> 
> In worst case case scenario I would go with partitioned index (5-10
> partitions, but not thousands)
> 
> 
> On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote:
> 
>> Hi Guys,
>> 
>> Thank you very much for your answers.
>> 
>> I will do some profiling on memory usage, but is there any documentation
>> on how Lucene uses/allocates the memory?
>> 
>> Best wishes,
>> Rui Wang
>> 
>> 
>> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
>> 
>>> hi
>>> 
>>>>> would the memory usage go through the roof?
>>> 
>>> Yup ....
>>> 
>>> My past experience got me pickels  in there...
>>> 
>>> 
>>> 
>>> with regards
>>> karthik
>>> 
>>> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> We are planning to use lucene in our project, but not entirely sure
>> about
>>>> some of the design decisions were made. Below are the details, any
>>>> comments/suggestions are more than welcome.
>>>> 
>>>> The requirements of the project are below:
>>>> 
>>>> 1. We have  tens of thousands of files, their size ranging from 500M to
>> a
>>>> few terabytes, and majority of the contents in these files will not be
>>>> accessed frequently.
>>>> 
>>>> 2. We are planning to keep less accessed contents outside of our
>> database,
>>>> store them on the file system.
>>>> 
>>>> 3. We also have code to get the binary position of these contents in the
>>>> files. Using these binary positions, we can quickly retrieve the
>> contents
>>>> and convert them into our domain objects.
>>>> 
>>>> We think Lucene provides a scalable solution for storing and indexing
>>>> these binary positions, so the idea is that each piece of the content in
>>>> the files will a document, each document will have at least an ID field
>> to
>>>> identify to content and a binary position field contains the starting
>> and
>>>> stop position of the content. Having done some performance testing, it
>>>> seems to us that Lucene is well capable of doing this.
>>>> 
>>>> At the moment, we are planning to create one Lucene index per file, so
>> if
>>>> we have new files to be added to the system, we can simply generate a
>> new
>>>> index. The problem is do with searching, this approach means that we
>> need
>>>> to create an new IndexSearcher every time a file is accessed through our
>>>> web service. We knew that it is rather expensive to open a new
>>>> IndexSearcher, and are thinking of using some kind of pooling mechanism.
>>>> Our questions are:
>>>> 
>>>> 1. Is this one index per file approach a viable solution? What do you
>>>> think about pooling IndexSearcher?
>>>> 
>>>> 2. If we have many IndexSearchers opened at the same time, would the
>>>> memory usage go through the roof? I couldn't find any document on how
>>>> Lucene use allocate memory.
>>>> 
>>>> Thank you very much for your help.
>>>> 
>>>> Many thanks,
>>>> Rui Wang
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> *N.S.KARTHIK
>>> R.M.S.COLONY
>>> BEHIND BANK OF INDIA
>>> R.M.V 2ND STAGE
>>> BANGALORE
>>> 560094*
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Use multiple lucene indices

Posted by Danil ŢORIN <to...@gmail.com>.

How many documents there are in the system ?
approximate it by: 20000 files * avg(docs/file)

>From my understanding your queries will be just lookup for a document ID
(Q: are those IDs unique between files? or you need to filter by filename?)
If that will be the only usecase than maybe you should consider some other
lookup systems, a ehcache offloaded and persistent on disk might work just
as well.

If you are anywhere < 200 mln documents I'd say you should go with a single
index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
In a slightly beefier host and Lucene4 (try various codecs for speed/memory
usage) I think you could go to 1 bln documents.

If you plan on more complex queries..like given a position in a file,
identify a document that contains it...than the number of documents should
be reconsidered.

In worst case case scenario I would go with partitioned index (5-10
partitions, but not thousands)


On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote:

> Hi Guys,
>
> Thank you very much for your answers.
>
> I will do some profiling on memory usage, but is there any documentation
> on how Lucene uses/allocates the memory?
>
> Best wishes,
> Rui Wang
>
>
> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
>
> > hi
> >
> >>> would the memory usage go through the roof?
> >
> > Yup ....
> >
> > My past experience got me pickels  in there...
> >
> >
> >
> > with regards
> > karthik
> >
> > On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:
> >
> >> Hi All,
> >>
> >> We are planning to use lucene in our project, but not entirely sure
> about
> >> some of the design decisions were made. Below are the details, any
> >> comments/suggestions are more than welcome.
> >>
> >> The requirements of the project are below:
> >>
> >> 1. We have  tens of thousands of files, their size ranging from 500M to
> a
> >> few terabytes, and majority of the contents in these files will not be
> >> accessed frequently.
> >>
> >> 2. We are planning to keep less accessed contents outside of our
> database,
> >> store them on the file system.
> >>
> >> 3. We also have code to get the binary position of these contents in the
> >> files. Using these binary positions, we can quickly retrieve the
> contents
> >> and convert them into our domain objects.
> >>
> >> We think Lucene provides a scalable solution for storing and indexing
> >> these binary positions, so the idea is that each piece of the content in
> >> the files will a document, each document will have at least an ID field
> to
> >> identify to content and a binary position field contains the starting
> and
> >> stop position of the content. Having done some performance testing, it
> >> seems to us that Lucene is well capable of doing this.
> >>
> >> At the moment, we are planning to create one Lucene index per file, so
> if
> >> we have new files to be added to the system, we can simply generate a
> new
> >> index. The problem is do with searching, this approach means that we
> need
> >> to create an new IndexSearcher every time a file is accessed through our
> >> web service. We knew that it is rather expensive to open a new
> >> IndexSearcher, and are thinking of using some kind of pooling mechanism.
> >> Our questions are:
> >>
> >> 1. Is this one index per file approach a viable solution? What do you
> >> think about pooling IndexSearcher?
> >>
> >> 2. If we have many IndexSearchers opened at the same time, would the
> >> memory usage go through the roof? I couldn't find any document on how
> >> Lucene use allocate memory.
> >>
> >> Thank you very much for your help.
> >>
> >> Many thanks,
> >> Rui Wang
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > *N.S.KARTHIK
> > R.M.S.COLONY
> > BEHIND BANK OF INDIA
> > R.M.V 2ND STAGE
> > BANGALORE
> > 560094*
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Use multiple lucene indices

Posted by Rui Wang <rw...@ebi.ac.uk>.

Hi Guys,

Thank you very much for your answers. 

I will do some profiling on memory usage, but is there any documentation on how Lucene uses/allocates the memory? 

Best wishes,
Rui Wang


On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:

> hi
> 
>>> would the memory usage go through the roof?
> 
> Yup ....
> 
> My past experience got me pickels  in there...
> 
> 
> 
> with regards
> karthik
> 
> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:
> 
>> Hi All,
>> 
>> We are planning to use lucene in our project, but not entirely sure about
>> some of the design decisions were made. Below are the details, any
>> comments/suggestions are more than welcome.
>> 
>> The requirements of the project are below:
>> 
>> 1. We have  tens of thousands of files, their size ranging from 500M to a
>> few terabytes, and majority of the contents in these files will not be
>> accessed frequently.
>> 
>> 2. We are planning to keep less accessed contents outside of our database,
>> store them on the file system.
>> 
>> 3. We also have code to get the binary position of these contents in the
>> files. Using these binary positions, we can quickly retrieve the contents
>> and convert them into our domain objects.
>> 
>> We think Lucene provides a scalable solution for storing and indexing
>> these binary positions, so the idea is that each piece of the content in
>> the files will a document, each document will have at least an ID field to
>> identify to content and a binary position field contains the starting and
>> stop position of the content. Having done some performance testing, it
>> seems to us that Lucene is well capable of doing this.
>> 
>> At the moment, we are planning to create one Lucene index per file, so if
>> we have new files to be added to the system, we can simply generate a new
>> index. The problem is do with searching, this approach means that we need
>> to create an new IndexSearcher every time a file is accessed through our
>> web service. We knew that it is rather expensive to open a new
>> IndexSearcher, and are thinking of using some kind of pooling mechanism.
>> Our questions are:
>> 
>> 1. Is this one index per file approach a viable solution? What do you
>> think about pooling IndexSearcher?
>> 
>> 2. If we have many IndexSearchers opened at the same time, would the
>> memory usage go through the roof? I couldn't find any document on how
>> Lucene use allocate memory.
>> 
>> Thank you very much for your help.
>> 
>> Many thanks,
>> Rui Wang
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
> 
> 
> -- 
> *N.S.KARTHIK
> R.M.S.COLONY
> BEHIND BANK OF INDIA
> R.M.V 2ND STAGE
> BANGALORE
> 560094*


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Use multiple lucene indices

Posted by KARTHIK SHIVAKUMAR <ns...@gmail.com>.

hi

>> would the memory usage go through the roof?

Yup ....

My past experience got me pickels  in there...



with regards
karthik

On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:

> Hi All,
>
> We are planning to use lucene in our project, but not entirely sure about
> some of the design decisions were made. Below are the details, any
> comments/suggestions are more than welcome.
>
> The requirements of the project are below:
>
> 1. We have  tens of thousands of files, their size ranging from 500M to a
> few terabytes, and majority of the contents in these files will not be
> accessed frequently.
>
> 2. We are planning to keep less accessed contents outside of our database,
> store them on the file system.
>
> 3. We also have code to get the binary position of these contents in the
> files. Using these binary positions, we can quickly retrieve the contents
> and convert them into our domain objects.
>
> We think Lucene provides a scalable solution for storing and indexing
> these binary positions, so the idea is that each piece of the content in
> the files will a document, each document will have at least an ID field to
> identify to content and a binary position field contains the starting and
> stop position of the content. Having done some performance testing, it
> seems to us that Lucene is well capable of doing this.
>
> At the moment, we are planning to create one Lucene index per file, so if
> we have new files to be added to the system, we can simply generate a new
> index. The problem is do with searching, this approach means that we need
> to create an new IndexSearcher every time a file is accessed through our
> web service. We knew that it is rather expensive to open a new
> IndexSearcher, and are thinking of using some kind of pooling mechanism.
> Our questions are:
>
> 1. Is this one index per file approach a viable solution? What do you
> think about pooling IndexSearcher?
>
> 2. If we have many IndexSearchers opened at the same time, would the
> memory usage go through the roof? I couldn't find any document on how
> Lucene use allocate memory.
>
> Thank you very much for your help.
>
> Many thanks,
> Rui Wang
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
*N.S.KARTHIK
R.M.S.COLONY
BEHIND BANK OF INDIA
R.M.V 2ND STAGE
BANGALORE
560094*