You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Ali Salehi <al...@epfl.ch> on 2007/01/08 20:33:04 UTC

Lucene Scalability Question

Hello,
 I have a question about the scalability of the Lucene.
 I'm a lucene beginner and I would like to use it to index several
 million measurements (400 Millions). A measurement has a type,
 owner, id, precision and data.
 As an experiment, I tried to insert 5M values into a lucene index using
 compound index with merge factor 100,000.
 For searching I have two problems :

 1. The search time for simple queries such as precision:\+0002 is really
  high (4-10 seconds). I want to know if this search time is normal
considering the amount of data I inserted to the lucene (5 Million
values)?
 If not, how can I improve it. I'm sure I can improve it by upgrading
 my current box (1G memory and 3.2 Ghz CPU with 2 MB cache).
 I'm looking for software/configuration solutions ?

 2. The search gives TooManyClauses exception when I'm searching for a
 data item with the queries similar to the one below :

 precision:\+0002 AND data:\+0.85*

 I guess this a bug ?!

Thanks for your help,
Ali Salehi



**************************************************************
Ali Salehi, LSIR - Distributed Information Systems Laboratory
EPFL-IC-IIF-LSIR, Bâtiment BC, Station 14, CH-1015 Lausanne, Switzerland.
http://lsirwww.epfl.ch/
email: ali.salehi@epfl.ch
Tel: +41-21-6936656 Fax: +41-21-6938115


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by Steven Rowe <sa...@syr.edu>.

J. Delgado wrote:
> I'm looking to hear new ideas people may have to solve this very hard
> problem.

https://issues.apache.org/jira/browse/LUCENE-724


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 10, 2007, at 3:37 PM, J. Delgado wrote:

> No, Oracle Text does not use Lucene. It has its own proprietary


Someone has contributed an code that allows Lucene to be run inside  
of Oracle's JVM, this is different from Oracle Text.  Search the User/ 
Dev list for recent posts on Oracle and you'll see what Robert means.


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by robert engels <re...@ix.netcom.com>.

It appears the submitter is working at solving all of these issue -  
basically a pluggable index.

You should review his emails on the subject.

On Jan 10, 2007, at 3:03 PM, J. Delgado wrote:

> This sounds very interesting... I'll defenitely have a look into it.
> However I have the feeling that, like the use of Oracle Text, this is
> keeping seperate the underlying data structures used for evaluating
> full-text and conditions over other data types, which brings up other
> issues when trying to do full-blown mixed queries. Things get worse
> when doing joins and other relational algebra operations.
>
> I'm still wondering if the basic data structures should be revised to
> achieve better performance...
>
> -- Joaquin
>
> 2007/1/10, robert engels <re...@ix.netcom.com>:
>> There is a module in Lucene contrib that changes that! It loads
>> Lucene into the Oracle database (it has a JVM), and allows Lucene
>> syntax to perform full-text searching.
>>
>> On Jan 10, 2007, at 2:37 PM, J. Delgado wrote:
>>
>> > No, Oracle Text does not use Lucene. It has its own proprietary
>> > full-text engine. It represents documents, the inverted index and
>> > relationships in a DB schema and it depends heavily on the SQL  
>> layer.
>> > This has some severe limitations though...
>> >
>> > Of course, you can push structured data into full-text based  
>> indexes.
>> > We have seen how in Lucene we can represent some structured data  
>> types
>> > (e.g. dates, numbers) as fields and perform some type of mixed  
>> queries
>> > but the Lucene index, as some of you have pointed out, is not meant
>> > for this and does not scale like a DB would.
>> >
>> > I'm looking to hear new ideas people may have to solve this very
>> > hard problem.
>> >
>> > -- Joaquin
>> >
>> > 2007/1/10, robert engels <re...@ix.netcom.com>:
>> >> I think the contrib 'Oracle Full Text' does this (although in the
>> >> reverse).
>> >>
>> >> It uses Lucene for full text queries (embedded into the db), the
>> >> query analyzer works.
>> >>
>> >> It is really a great piece of software. Do bad it can't be done  
>> in a
>> >> standard way so that it would work with all dbs.
>> >>
>> >> I think it may be possible to embedded the Apache Derby to do
>> >> something like this, although this might be overkill. A simple  
>> b-tree
>> >> db might work best.
>> >>
>> >> It would be interesting if the documents could be stored in a  
>> btree,
>> >> and a GUID used to access them (since the lucene docid is  
>> constantly
>> >> changing). The only stored field in a lucene Document would be the
>> >> GUID.
>> >>
>> >> On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:
>> >>
>> >> > This is a more general question:
>> >> >
>> >> > Given the fact that most applications require querying a
>> >> combination
>> >> > of full-text and structured data has anyone looked into building
>> >> data
>> >> > structures at the most fundamental level  (e.g. combination  
>> of b-
>> >> tree
>> >> > and inverted lists) that would enable scalable and performant
>> >> > structured (e.g.SQL or XQuery) + Full-Text queries?
>> >> >
>> >> > Can Lucene be taken as basis for this or do you recommend  
>> exploring
>> >> > other routes?
>> >> >
>> >> > -- Joaquin
>> >> >
>> >> > 2007/1/10, Chris Hostetter <ho...@fucit.org>:
>> >> >>
>> >> >> : So you mean lucene can't do better than this ?
>> >> >>
>> >> >> robert's point is that based on what you've told us, there  
>> is no
>> >> >> reason to
>> >> >> think Lucene makes sense for you -- if *all* you are doing is
>> >> finding
>> >> >> documents based on numeric rnages, then a relational  
>> database is
>> >> >> petter
>> >> >> suited to your task.  if you accutally care about the tetual IR
>> >> >> features
>> >> >> of Lucene, then there are probably ways to make your searches
>> >> >> faster, but
>> >> >> you aren't giving us enough information.
>> >> >>
>> >> >> you said the example code you gave was in a loop ... but a loop
>> >> >> over what?
>> >> >> .. what cahnges with each iteration of the loop? ... if  
>> there are
>> >> >> RangeFilter's that ge reused more then once,  
>> CachingWrapperFilter
>> >> >> can come
>> >> >> in handy to ensure that work isn't done more often then it  
>> needs
>> >> >> to me.
>> >> >>
>> >> >> it's also not clear wether your query on "type:0" is just a
>> >> >> placeholder,
>> >> >> or indicative of what you acctually want to do in the long  
>> run ...
>> >> >> if all
>> >> >> of your queries are this simple, and all you care about is  
>> getting
>> >> >> a count
>> >> >> of things that have type:0 and are in your numeric ranges, then
>> >> >> don'g use
>> >> >> the "search" method at all, just put "type:0" in your
>> >> >> ChainedFilter and
>> >> >> call the "bits" method directly.
>> >> >>
>> >> >> you also haven't given us any information about wether or not
>> >> you are
>> >> >> opening a new IndexSearcher/IndexReader every time you  
>> execute a
>> >> >> query, or
>> >> >> resuing the same instance -- reuse makes the perofrance much
>> >> better
>> >> >> because it can reuse underlying resources.
>> >> >>
>> >> >> In short: if you state some performance numbers from timing  
>> some
>> >> >> code, and
>> >> >> want to know how to make that code faster, you have to actualy
>> >> >> show people
>> >> >> *all* of the code for them to be able to help you.
>> >> >>
>> >> >>
>> >> >> : >>  I still have the search problem I had before, now search
>> >> >> takes around
>> >> >> : >> 750
>> >> >> : >> msecs for a small set of documents.
>> >> >> : >>
>> >> >> : >>     [java] Total Query Processing time (msec) : 38745
>> >> >> : >>     [java] Total No. of Documents : 7,500,000
>> >> >> : >>     [java] Total No. of Executed queries : 50.0
>> >> >> : >>     [java] Execution time per query : 774.9 msec
>> >> >> : >>
>> >> >> : >>  The index is optimized and its size is 830 MB.
>> >> >> : >>  Each document has the following terms :
>> >> >> : >>     VSID(integer), data(float), type(short int) ,  
>> precision
>> >> >> (byte).
>> >> >> : >>   The queries are generate in a loop similar to one  
>> below :
>> >> >> : >> loop ...
>> >> >> : >>     RangeFilter rq1 = new
>> >> >> : >> RangeFilter
>> >> >> ("data","+5.43243243440000","+5.43243243449999"true,true);
>> >> >> : >>     RangeFilter rq2 = new RangeFilter
>> >> >> : >> ("precision","+0001","+0002",true,true);
>> >> >> : >>     ChainedFilter cf = new ChainedFilter(new
>> >> >> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
>> >> >> : >>     Query query = qp.parse("type:0");
>> >> >> : >>     Hits hits = searcher.search(query,cf);
>> >> >> : >> end loop
>> >> >> : >>
>> >> >> : >>  I would like to know if there exist any solution to  
>> improve
>> >> >> the search
>> >> >> : >> time ?  (I need to insert more than 500 million of  
>> these data
>> >> >> pages into
>> >> >> : >> lucene)
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> -Hoss
>> >> >>
>> >> >>
>> >> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev- 
>> help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>  
>> ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >
>> >>
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by "J. Delgado" <jo...@gmail.com>.

This sounds very interesting... I'll defenitely have a look into it.
However I have the feeling that, like the use of Oracle Text, this is
keeping seperate the underlying data structures used for evaluating
full-text and conditions over other data types, which brings up other
issues when trying to do full-blown mixed queries. Things get worse
when doing joins and other relational algebra operations.

I'm still wondering if the basic data structures should be revised to
achieve better performance...

-- Joaquin

2007/1/10, robert engels <re...@ix.netcom.com>:
> There is a module in Lucene contrib that changes that! It loads
> Lucene into the Oracle database (it has a JVM), and allows Lucene
> syntax to perform full-text searching.
>
> On Jan 10, 2007, at 2:37 PM, J. Delgado wrote:
>
> > No, Oracle Text does not use Lucene. It has its own proprietary
> > full-text engine. It represents documents, the inverted index and
> > relationships in a DB schema and it depends heavily on the SQL layer.
> > This has some severe limitations though...
> >
> > Of course, you can push structured data into full-text based indexes.
> > We have seen how in Lucene we can represent some structured data types
> > (e.g. dates, numbers) as fields and perform some type of mixed queries
> > but the Lucene index, as some of you have pointed out, is not meant
> > for this and does not scale like a DB would.
> >
> > I'm looking to hear new ideas people may have to solve this very
> > hard problem.
> >
> > -- Joaquin
> >
> > 2007/1/10, robert engels <re...@ix.netcom.com>:
> >> I think the contrib 'Oracle Full Text' does this (although in the
> >> reverse).
> >>
> >> It uses Lucene for full text queries (embedded into the db), the
> >> query analyzer works.
> >>
> >> It is really a great piece of software. Do bad it can't be done in a
> >> standard way so that it would work with all dbs.
> >>
> >> I think it may be possible to embedded the Apache Derby to do
> >> something like this, although this might be overkill. A simple b-tree
> >> db might work best.
> >>
> >> It would be interesting if the documents could be stored in a btree,
> >> and a GUID used to access them (since the lucene docid is constantly
> >> changing). The only stored field in a lucene Document would be the
> >> GUID.
> >>
> >> On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:
> >>
> >> > This is a more general question:
> >> >
> >> > Given the fact that most applications require querying a
> >> combination
> >> > of full-text and structured data has anyone looked into building
> >> data
> >> > structures at the most fundamental level  (e.g. combination of b-
> >> tree
> >> > and inverted lists) that would enable scalable and performant
> >> > structured (e.g.SQL or XQuery) + Full-Text queries?
> >> >
> >> > Can Lucene be taken as basis for this or do you recommend exploring
> >> > other routes?
> >> >
> >> > -- Joaquin
> >> >
> >> > 2007/1/10, Chris Hostetter <ho...@fucit.org>:
> >> >>
> >> >> : So you mean lucene can't do better than this ?
> >> >>
> >> >> robert's point is that based on what you've told us, there is no
> >> >> reason to
> >> >> think Lucene makes sense for you -- if *all* you are doing is
> >> finding
> >> >> documents based on numeric rnages, then a relational database is
> >> >> petter
> >> >> suited to your task.  if you accutally care about the tetual IR
> >> >> features
> >> >> of Lucene, then there are probably ways to make your searches
> >> >> faster, but
> >> >> you aren't giving us enough information.
> >> >>
> >> >> you said the example code you gave was in a loop ... but a loop
> >> >> over what?
> >> >> .. what cahnges with each iteration of the loop? ... if there are
> >> >> RangeFilter's that ge reused more then once, CachingWrapperFilter
> >> >> can come
> >> >> in handy to ensure that work isn't done more often then it needs
> >> >> to me.
> >> >>
> >> >> it's also not clear wether your query on "type:0" is just a
> >> >> placeholder,
> >> >> or indicative of what you acctually want to do in the long run ...
> >> >> if all
> >> >> of your queries are this simple, and all you care about is getting
> >> >> a count
> >> >> of things that have type:0 and are in your numeric ranges, then
> >> >> don'g use
> >> >> the "search" method at all, just put "type:0" in your
> >> >> ChainedFilter and
> >> >> call the "bits" method directly.
> >> >>
> >> >> you also haven't given us any information about wether or not
> >> you are
> >> >> opening a new IndexSearcher/IndexReader every time you execute a
> >> >> query, or
> >> >> resuing the same instance -- reuse makes the perofrance much
> >> better
> >> >> because it can reuse underlying resources.
> >> >>
> >> >> In short: if you state some performance numbers from timing some
> >> >> code, and
> >> >> want to know how to make that code faster, you have to actualy
> >> >> show people
> >> >> *all* of the code for them to be able to help you.
> >> >>
> >> >>
> >> >> : >>  I still have the search problem I had before, now search
> >> >> takes around
> >> >> : >> 750
> >> >> : >> msecs for a small set of documents.
> >> >> : >>
> >> >> : >>     [java] Total Query Processing time (msec) : 38745
> >> >> : >>     [java] Total No. of Documents : 7,500,000
> >> >> : >>     [java] Total No. of Executed queries : 50.0
> >> >> : >>     [java] Execution time per query : 774.9 msec
> >> >> : >>
> >> >> : >>  The index is optimized and its size is 830 MB.
> >> >> : >>  Each document has the following terms :
> >> >> : >>     VSID(integer), data(float), type(short int) , precision
> >> >> (byte).
> >> >> : >>   The queries are generate in a loop similar to one below :
> >> >> : >> loop ...
> >> >> : >>     RangeFilter rq1 = new
> >> >> : >> RangeFilter
> >> >> ("data","+5.43243243440000","+5.43243243449999"true,true);
> >> >> : >>     RangeFilter rq2 = new RangeFilter
> >> >> : >> ("precision","+0001","+0002",true,true);
> >> >> : >>     ChainedFilter cf = new ChainedFilter(new
> >> >> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
> >> >> : >>     Query query = qp.parse("type:0");
> >> >> : >>     Hits hits = searcher.search(query,cf);
> >> >> : >> end loop
> >> >> : >>
> >> >> : >>  I would like to know if there exist any solution to improve
> >> >> the search
> >> >> : >> time ?  (I need to insert more than 500 million of these data
> >> >> pages into
> >> >> : >> lucene)
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> -Hoss
> >> >>
> >> >>
> >> >>
> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by robert engels <re...@ix.netcom.com>.

There is a module in Lucene contrib that changes that! It loads  
Lucene into the Oracle database (it has a JVM), and allows Lucene  
syntax to perform full-text searching.

On Jan 10, 2007, at 2:37 PM, J. Delgado wrote:

> No, Oracle Text does not use Lucene. It has its own proprietary
> full-text engine. It represents documents, the inverted index and
> relationships in a DB schema and it depends heavily on the SQL layer.
> This has some severe limitations though...
>
> Of course, you can push structured data into full-text based indexes.
> We have seen how in Lucene we can represent some structured data types
> (e.g. dates, numbers) as fields and perform some type of mixed queries
> but the Lucene index, as some of you have pointed out, is not meant
> for this and does not scale like a DB would.
>
> I'm looking to hear new ideas people may have to solve this very  
> hard problem.
>
> -- Joaquin
>
> 2007/1/10, robert engels <re...@ix.netcom.com>:
>> I think the contrib 'Oracle Full Text' does this (although in the
>> reverse).
>>
>> It uses Lucene for full text queries (embedded into the db), the
>> query analyzer works.
>>
>> It is really a great piece of software. Do bad it can't be done in a
>> standard way so that it would work with all dbs.
>>
>> I think it may be possible to embedded the Apache Derby to do
>> something like this, although this might be overkill. A simple b-tree
>> db might work best.
>>
>> It would be interesting if the documents could be stored in a btree,
>> and a GUID used to access them (since the lucene docid is constantly
>> changing). The only stored field in a lucene Document would be the  
>> GUID.
>>
>> On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:
>>
>> > This is a more general question:
>> >
>> > Given the fact that most applications require querying a  
>> combination
>> > of full-text and structured data has anyone looked into building  
>> data
>> > structures at the most fundamental level  (e.g. combination of b- 
>> tree
>> > and inverted lists) that would enable scalable and performant
>> > structured (e.g.SQL or XQuery) + Full-Text queries?
>> >
>> > Can Lucene be taken as basis for this or do you recommend exploring
>> > other routes?
>> >
>> > -- Joaquin
>> >
>> > 2007/1/10, Chris Hostetter <ho...@fucit.org>:
>> >>
>> >> : So you mean lucene can't do better than this ?
>> >>
>> >> robert's point is that based on what you've told us, there is no
>> >> reason to
>> >> think Lucene makes sense for you -- if *all* you are doing is  
>> finding
>> >> documents based on numeric rnages, then a relational database is
>> >> petter
>> >> suited to your task.  if you accutally care about the tetual IR
>> >> features
>> >> of Lucene, then there are probably ways to make your searches
>> >> faster, but
>> >> you aren't giving us enough information.
>> >>
>> >> you said the example code you gave was in a loop ... but a loop
>> >> over what?
>> >> .. what cahnges with each iteration of the loop? ... if there are
>> >> RangeFilter's that ge reused more then once, CachingWrapperFilter
>> >> can come
>> >> in handy to ensure that work isn't done more often then it needs
>> >> to me.
>> >>
>> >> it's also not clear wether your query on "type:0" is just a
>> >> placeholder,
>> >> or indicative of what you acctually want to do in the long run ...
>> >> if all
>> >> of your queries are this simple, and all you care about is getting
>> >> a count
>> >> of things that have type:0 and are in your numeric ranges, then
>> >> don'g use
>> >> the "search" method at all, just put "type:0" in your
>> >> ChainedFilter and
>> >> call the "bits" method directly.
>> >>
>> >> you also haven't given us any information about wether or not  
>> you are
>> >> opening a new IndexSearcher/IndexReader every time you execute a
>> >> query, or
>> >> resuing the same instance -- reuse makes the perofrance much  
>> better
>> >> because it can reuse underlying resources.
>> >>
>> >> In short: if you state some performance numbers from timing some
>> >> code, and
>> >> want to know how to make that code faster, you have to actualy
>> >> show people
>> >> *all* of the code for them to be able to help you.
>> >>
>> >>
>> >> : >>  I still have the search problem I had before, now search
>> >> takes around
>> >> : >> 750
>> >> : >> msecs for a small set of documents.
>> >> : >>
>> >> : >>     [java] Total Query Processing time (msec) : 38745
>> >> : >>     [java] Total No. of Documents : 7,500,000
>> >> : >>     [java] Total No. of Executed queries : 50.0
>> >> : >>     [java] Execution time per query : 774.9 msec
>> >> : >>
>> >> : >>  The index is optimized and its size is 830 MB.
>> >> : >>  Each document has the following terms :
>> >> : >>     VSID(integer), data(float), type(short int) , precision
>> >> (byte).
>> >> : >>   The queries are generate in a loop similar to one below :
>> >> : >> loop ...
>> >> : >>     RangeFilter rq1 = new
>> >> : >> RangeFilter
>> >> ("data","+5.43243243440000","+5.43243243449999"true,true);
>> >> : >>     RangeFilter rq2 = new RangeFilter
>> >> : >> ("precision","+0001","+0002",true,true);
>> >> : >>     ChainedFilter cf = new ChainedFilter(new
>> >> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
>> >> : >>     Query query = qp.parse("type:0");
>> >> : >>     Hits hits = searcher.search(query,cf);
>> >> : >> end loop
>> >> : >>
>> >> : >>  I would like to know if there exist any solution to improve
>> >> the search
>> >> : >> time ?  (I need to insert more than 500 million of these data
>> >> pages into
>> >> : >> lucene)
>> >>
>> >>
>> >>
>> >>
>> >> -Hoss
>> >>
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by "J. Delgado" <jo...@gmail.com>.

No, Oracle Text does not use Lucene. It has its own proprietary
full-text engine. It represents documents, the inverted index and
relationships in a DB schema and it depends heavily on the SQL layer.
This has some severe limitations though...

Of course, you can push structured data into full-text based indexes.
We have seen how in Lucene we can represent some structured data types
(e.g. dates, numbers) as fields and perform some type of mixed queries
but the Lucene index, as some of you have pointed out, is not meant
for this and does not scale like a DB would.

I'm looking to hear new ideas people may have to solve this very hard problem.

-- Joaquin

2007/1/10, robert engels <re...@ix.netcom.com>:
> I think the contrib 'Oracle Full Text' does this (although in the
> reverse).
>
> It uses Lucene for full text queries (embedded into the db), the
> query analyzer works.
>
> It is really a great piece of software. Do bad it can't be done in a
> standard way so that it would work with all dbs.
>
> I think it may be possible to embedded the Apache Derby to do
> something like this, although this might be overkill. A simple b-tree
> db might work best.
>
> It would be interesting if the documents could be stored in a btree,
> and a GUID used to access them (since the lucene docid is constantly
> changing). The only stored field in a lucene Document would be the GUID.
>
> On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:
>
> > This is a more general question:
> >
> > Given the fact that most applications require querying a combination
> > of full-text and structured data has anyone looked into building data
> > structures at the most fundamental level  (e.g. combination of b-tree
> > and inverted lists) that would enable scalable and performant
> > structured (e.g.SQL or XQuery) + Full-Text queries?
> >
> > Can Lucene be taken as basis for this or do you recommend exploring
> > other routes?
> >
> > -- Joaquin
> >
> > 2007/1/10, Chris Hostetter <ho...@fucit.org>:
> >>
> >> : So you mean lucene can't do better than this ?
> >>
> >> robert's point is that based on what you've told us, there is no
> >> reason to
> >> think Lucene makes sense for you -- if *all* you are doing is finding
> >> documents based on numeric rnages, then a relational database is
> >> petter
> >> suited to your task.  if you accutally care about the tetual IR
> >> features
> >> of Lucene, then there are probably ways to make your searches
> >> faster, but
> >> you aren't giving us enough information.
> >>
> >> you said the example code you gave was in a loop ... but a loop
> >> over what?
> >> .. what cahnges with each iteration of the loop? ... if there are
> >> RangeFilter's that ge reused more then once, CachingWrapperFilter
> >> can come
> >> in handy to ensure that work isn't done more often then it needs
> >> to me.
> >>
> >> it's also not clear wether your query on "type:0" is just a
> >> placeholder,
> >> or indicative of what you acctually want to do in the long run ...
> >> if all
> >> of your queries are this simple, and all you care about is getting
> >> a count
> >> of things that have type:0 and are in your numeric ranges, then
> >> don'g use
> >> the "search" method at all, just put "type:0" in your
> >> ChainedFilter and
> >> call the "bits" method directly.
> >>
> >> you also haven't given us any information about wether or not you are
> >> opening a new IndexSearcher/IndexReader every time you execute a
> >> query, or
> >> resuing the same instance -- reuse makes the perofrance much better
> >> because it can reuse underlying resources.
> >>
> >> In short: if you state some performance numbers from timing some
> >> code, and
> >> want to know how to make that code faster, you have to actualy
> >> show people
> >> *all* of the code for them to be able to help you.
> >>
> >>
> >> : >>  I still have the search problem I had before, now search
> >> takes around
> >> : >> 750
> >> : >> msecs for a small set of documents.
> >> : >>
> >> : >>     [java] Total Query Processing time (msec) : 38745
> >> : >>     [java] Total No. of Documents : 7,500,000
> >> : >>     [java] Total No. of Executed queries : 50.0
> >> : >>     [java] Execution time per query : 774.9 msec
> >> : >>
> >> : >>  The index is optimized and its size is 830 MB.
> >> : >>  Each document has the following terms :
> >> : >>     VSID(integer), data(float), type(short int) , precision
> >> (byte).
> >> : >>   The queries are generate in a loop similar to one below :
> >> : >> loop ...
> >> : >>     RangeFilter rq1 = new
> >> : >> RangeFilter
> >> ("data","+5.43243243440000","+5.43243243449999"true,true);
> >> : >>     RangeFilter rq2 = new RangeFilter
> >> : >> ("precision","+0001","+0002",true,true);
> >> : >>     ChainedFilter cf = new ChainedFilter(new
> >> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
> >> : >>     Query query = qp.parse("type:0");
> >> : >>     Hits hits = searcher.search(query,cf);
> >> : >> end loop
> >> : >>
> >> : >>  I would like to know if there exist any solution to improve
> >> the search
> >> : >> time ?  (I need to insert more than 500 million of these data
> >> pages into
> >> : >> lucene)
> >>
> >>
> >>
> >>
> >> -Hoss
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by robert engels <re...@ix.netcom.com>.

I think the contrib 'Oracle Full Text' does this (although in the  
reverse).

It uses Lucene for full text queries (embedded into the db), the  
query analyzer works.

It is really a great piece of software. Do bad it can't be done in a  
standard way so that it would work with all dbs.

I think it may be possible to embedded the Apache Derby to do  
something like this, although this might be overkill. A simple b-tree  
db might work best.

It would be interesting if the documents could be stored in a btree,  
and a GUID used to access them (since the lucene docid is constantly  
changing). The only stored field in a lucene Document would be the GUID.

On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:

> This is a more general question:
>
> Given the fact that most applications require querying a combination
> of full-text and structured data has anyone looked into building data
> structures at the most fundamental level  (e.g. combination of b-tree
> and inverted lists) that would enable scalable and performant
> structured (e.g.SQL or XQuery) + Full-Text queries?
>
> Can Lucene be taken as basis for this or do you recommend exploring
> other routes?
>
> -- Joaquin
>
> 2007/1/10, Chris Hostetter <ho...@fucit.org>:
>>
>> : So you mean lucene can't do better than this ?
>>
>> robert's point is that based on what you've told us, there is no  
>> reason to
>> think Lucene makes sense for you -- if *all* you are doing is finding
>> documents based on numeric rnages, then a relational database is  
>> petter
>> suited to your task.  if you accutally care about the tetual IR  
>> features
>> of Lucene, then there are probably ways to make your searches  
>> faster, but
>> you aren't giving us enough information.
>>
>> you said the example code you gave was in a loop ... but a loop  
>> over what?
>> .. what cahnges with each iteration of the loop? ... if there are
>> RangeFilter's that ge reused more then once, CachingWrapperFilter  
>> can come
>> in handy to ensure that work isn't done more often then it needs  
>> to me.
>>
>> it's also not clear wether your query on "type:0" is just a  
>> placeholder,
>> or indicative of what you acctually want to do in the long run ...  
>> if all
>> of your queries are this simple, and all you care about is getting  
>> a count
>> of things that have type:0 and are in your numeric ranges, then  
>> don'g use
>> the "search" method at all, just put "type:0" in your  
>> ChainedFilter and
>> call the "bits" method directly.
>>
>> you also haven't given us any information about wether or not you are
>> opening a new IndexSearcher/IndexReader every time you execute a  
>> query, or
>> resuing the same instance -- reuse makes the perofrance much better
>> because it can reuse underlying resources.
>>
>> In short: if you state some performance numbers from timing some  
>> code, and
>> want to know how to make that code faster, you have to actualy  
>> show people
>> *all* of the code for them to be able to help you.
>>
>>
>> : >>  I still have the search problem I had before, now search  
>> takes around
>> : >> 750
>> : >> msecs for a small set of documents.
>> : >>
>> : >>     [java] Total Query Processing time (msec) : 38745
>> : >>     [java] Total No. of Documents : 7,500,000
>> : >>     [java] Total No. of Executed queries : 50.0
>> : >>     [java] Execution time per query : 774.9 msec
>> : >>
>> : >>  The index is optimized and its size is 830 MB.
>> : >>  Each document has the following terms :
>> : >>     VSID(integer), data(float), type(short int) , precision  
>> (byte).
>> : >>   The queries are generate in a loop similar to one below :
>> : >> loop ...
>> : >>     RangeFilter rq1 = new
>> : >> RangeFilter 
>> ("data","+5.43243243440000","+5.43243243449999"true,true);
>> : >>     RangeFilter rq2 = new RangeFilter
>> : >> ("precision","+0001","+0002",true,true);
>> : >>     ChainedFilter cf = new ChainedFilter(new
>> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
>> : >>     Query query = qp.parse("type:0");
>> : >>     Hits hits = searcher.search(query,cf);
>> : >> end loop
>> : >>
>> : >>  I would like to know if there exist any solution to improve  
>> the search
>> : >> time ?  (I need to insert more than 500 million of these data  
>> pages into
>> : >> lucene)
>>
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by "J. Delgado" <jo...@gmail.com>.

This is a more general question:

Given the fact that most applications require querying a combination
of full-text and structured data has anyone looked into building data
structures at the most fundamental level  (e.g. combination of b-tree
and inverted lists) that would enable scalable and performant
structured (e.g.SQL or XQuery) + Full-Text queries?

Can Lucene be taken as basis for this or do you recommend exploring
other routes?

-- Joaquin

2007/1/10, Chris Hostetter <ho...@fucit.org>:
>
> : So you mean lucene can't do better than this ?
>
> robert's point is that based on what you've told us, there is no reason to
> think Lucene makes sense for you -- if *all* you are doing is finding
> documents based on numeric rnages, then a relational database is petter
> suited to your task.  if you accutally care about the tetual IR features
> of Lucene, then there are probably ways to make your searches faster, but
> you aren't giving us enough information.
>
> you said the example code you gave was in a loop ... but a loop over what?
> .. what cahnges with each iteration of the loop? ... if there are
> RangeFilter's that ge reused more then once, CachingWrapperFilter can come
> in handy to ensure that work isn't done more often then it needs to me.
>
> it's also not clear wether your query on "type:0" is just a placeholder,
> or indicative of what you acctually want to do in the long run ... if all
> of your queries are this simple, and all you care about is getting a count
> of things that have type:0 and are in your numeric ranges, then don'g use
> the "search" method at all, just put "type:0" in your ChainedFilter and
> call the "bits" method directly.
>
> you also haven't given us any information about wether or not you are
> opening a new IndexSearcher/IndexReader every time you execute a query, or
> resuing the same instance -- reuse makes the perofrance much better
> because it can reuse underlying resources.
>
> In short: if you state some performance numbers from timing some code, and
> want to know how to make that code faster, you have to actualy show people
> *all* of the code for them to be able to help you.
>
>
> : >>  I still have the search problem I had before, now search takes around
> : >> 750
> : >> msecs for a small set of documents.
> : >>
> : >>     [java] Total Query Processing time (msec) : 38745
> : >>     [java] Total No. of Documents : 7,500,000
> : >>     [java] Total No. of Executed queries : 50.0
> : >>     [java] Execution time per query : 774.9 msec
> : >>
> : >>  The index is optimized and its size is 830 MB.
> : >>  Each document has the following terms :
> : >>     VSID(integer), data(float), type(short int) , precision (byte).
> : >>   The queries are generate in a loop similar to one below :
> : >> loop ...
> : >>     RangeFilter rq1 = new
> : >> RangeFilter("data","+5.43243243440000","+5.43243243449999"true,true);
> : >>     RangeFilter rq2 = new RangeFilter
> : >> ("precision","+0001","+0002",true,true);
> : >>     ChainedFilter cf = new ChainedFilter(new
> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
> : >>     Query query = qp.parse("type:0");
> : >>     Hits hits = searcher.search(query,cf);
> : >> end loop
> : >>
> : >>  I would like to know if there exist any solution to improve the search
> : >> time ?  (I need to insert more than 500 million of these data pages into
> : >> lucene)
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by Chris Hostetter <ho...@fucit.org>.

: So you mean lucene can't do better than this ?

robert's point is that based on what you've told us, there is no reason to
think Lucene makes sense for you -- if *all* you are doing is finding
documents based on numeric rnages, then a relational database is petter
suited to your task.  if you accutally care about the tetual IR features
of Lucene, then there are probably ways to make your searches faster, but
you aren't giving us enough information.

you said the example code you gave was in a loop ... but a loop over what?
.. what cahnges with each iteration of the loop? ... if there are
RangeFilter's that ge reused more then once, CachingWrapperFilter can come
in handy to ensure that work isn't done more often then it needs to me.

it's also not clear wether your query on "type:0" is just a placeholder,
or indicative of what you acctually want to do in the long run ... if all
of your queries are this simple, and all you care about is getting a count
of things that have type:0 and are in your numeric ranges, then don'g use
the "search" method at all, just put "type:0" in your ChainedFilter and
call the "bits" method directly.

you also haven't given us any information about wether or not you are
opening a new IndexSearcher/IndexReader every time you execute a query, or
resuing the same instance -- reuse makes the perofrance much better
because it can reuse underlying resources.

In short: if you state some performance numbers from timing some code, and
want to know how to make that code faster, you have to actualy show people
*all* of the code for them to be able to help you.


: >>  I still have the search problem I had before, now search takes around
: >> 750
: >> msecs for a small set of documents.
: >>
: >>     [java] Total Query Processing time (msec) : 38745
: >>     [java] Total No. of Documents : 7,500,000
: >>     [java] Total No. of Executed queries : 50.0
: >>     [java] Execution time per query : 774.9 msec
: >>
: >>  The index is optimized and its size is 830 MB.
: >>  Each document has the following terms :
: >>     VSID(integer), data(float), type(short int) , precision (byte).
: >>   The queries are generate in a loop similar to one below :
: >> loop ...
: >>     RangeFilter rq1 = new
: >> RangeFilter("data",â+5.43243243440000â,â+5.43243243449999âtrue,true);
: >>     RangeFilter rq2 = new RangeFilter
: >> ("precision",â+0001â,â+0002â,true,true);
: >>     ChainedFilter cf = new ChainedFilter(new
: >> Filter[]{rq2,rq1},ChainedFilter.AND);
: >>     Query query = qp.parse("type:0");
: >>     Hits hits = searcher.search(query,cf);
: >> end loop
: >>
: >>  I would like to know if there exist any solution to improve the search
: >> time ?  (I need to insert more than 500 million of these data pages into
: >> lucene)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by Ali Salehi <al...@epfl.ch>.

So you mean lucene can't do better than this ?

Best,

On Wed, 10 Jan 2007 20:53:33 +0100, robert engels <re...@ix.netcom.com>  
wrote:

> I think you need a database, not Lucene - especially since you are not  
> even using any text !
>
> On Jan 10, 2007, at 1:39 PM, Ali Salehi wrote:
>
>> Hi,
>>  Thanks for your previous mail.
>>  Now I changed the configuration to use merging factor 50. I also  
>> disabled
>> the compound file parameter.
>>  I still have the search problem I had before, now search takes around  
>> 750
>> msecs for a small set of documents.
>>
>>     [java] Total Query Processing time (msec) : 38745
>>     [java] Total No. of Documents : 7,500,000
>>     [java] Total No. of Executed queries : 50.0
>>     [java] Execution time per query : 774.9 msec
>>
>>  The index is optimized and its size is 830 MB.
>>  Each document has the following terms :
>>     VSID(integer), data(float), type(short int) , precision (byte).
>>   The queries are generate in a loop similar to one below :
>> loop ...
>>     RangeFilter rq1 = new
>> RangeFilter("data",”+5.43243243440000”,”+5.43243243449999”true,true);
>>     RangeFilter rq2 = new RangeFilter 
>> ("precision",”+0001”,”+0002”,true,true);
>>     ChainedFilter cf = new ChainedFilter(new
>> Filter[]{rq2,rq1},ChainedFilter.AND);
>>     Query query = qp.parse("type:0");
>>     Hits hits = searcher.search(query,cf);
>> end loop
>>
>>  I would like to know if there exist any solution to improve the search
>> time ?  (I need to insert more than 500 million of these data pages into
>> lucene)
>>
>> Thanks,
>> AliS
>>
>>
>>
>>> On Monday 08 January 2007 20:33, Ali Salehi wrote:
>>>
>>>> Â 1. The search time for simple queries such as precision:\+0002 is
>>>> really high (4-10 seconds). I want to know if this search time is  
>>>> normal
>>>
>>>> Â 2. The search gives TooManyClauses exception when I'm searching for  
>>>> a
>>>> Â data item with the queries similar to the one below :
>>>
>>> Please see the FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ:
>>> Why am I getting a TooManyClauses exception?
>>> How do I speed up searching?
>>>
>>> If that doesn't help, please re-post you question on the user list.
>>>
>>> Regards
>>>  Daniel
>>>
>>> --
>>> http://www.danielnaber.de
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> **************************************************************
>> Ali Salehi, LSIR - Distributed Information Systems Laboratory
>> EPFL-IC-IIF-LSIR, Bâtiment BC, Station 14, CH-1015 Lausanne,  
>> Switzerland.
>> http://lsirwww.epfl.ch/
>> email: ali.salehi@epfl.ch
>> Tel: +41-21-6936656 Fax: +41-21-6938115
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



-- 
**************************************************************
Ali Salehi, LSIR - Distributed Information Systems Laboratory
EPFL-IC-IIF-LSIR, Bâtiment BC, Station 14, CH-1015 Lausanne, Switzerland.
http://lsirwww.epfl.ch/
email: ali.salehi@epfl.ch
Tel: +41-21-6936656 Fax: +41-21-6938115

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by robert engels <re...@ix.netcom.com>.

I think you need a database, not Lucene - especially since you are  
not even using any text !

On Jan 10, 2007, at 1:39 PM, Ali Salehi wrote:

> Hi,
>  Thanks for your previous mail.
>  Now I changed the configuration to use merging factor 50. I also  
> disabled
> the compound file parameter.
>  I still have the search problem I had before, now search takes  
> around 750
> msecs for a small set of documents.
>
>     [java] Total Query Processing time (msec) : 38745
>     [java] Total No. of Documents : 7,500,000
>     [java] Total No. of Executed queries : 50.0
>     [java] Execution time per query : 774.9 msec
>
>  The index is optimized and its size is 830 MB.
>  Each document has the following terms :
>     VSID(integer), data(float), type(short int) , precision (byte).
>   The queries are generate in a loop similar to one below :
> loop ...
>     RangeFilter rq1 = new
> RangeFilter("data",”+5.43243243440000”,”+5.43243243449999”true,true);
>     RangeFilter rq2 = new RangeFilter 
> ("precision",”+0001”,”+0002”,true,true);
>     ChainedFilter cf = new ChainedFilter(new
> Filter[]{rq2,rq1},ChainedFilter.AND);
>     Query query = qp.parse("type:0");
>     Hits hits = searcher.search(query,cf);
> end loop
>
>  I would like to know if there exist any solution to improve the  
> search
> time ?  (I need to insert more than 500 million of these data pages  
> into
> lucene)
>
> Thanks,
> AliS
>
>
>
>> On Monday 08 January 2007 20:33, Ali Salehi wrote:
>>
>>> Â 1. The search time for simple queries such as precision:\+0002 is
>>> really high (4-10 seconds). I want to know if this search time is  
>>> normal
>>
>>> Â 2. The search gives TooManyClauses exception when I'm searching  
>>> for a
>>> Â data item with the queries similar to the one below :
>>
>> Please see the FAQ at http://wiki.apache.org/jakarta-lucene/ 
>> LuceneFAQ:
>> Why am I getting a TooManyClauses exception?
>> How do I speed up searching?
>>
>> If that doesn't help, please re-post you question on the user list.
>>
>> Regards
>>  Daniel
>>
>> --
>> http://www.danielnaber.de
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> **************************************************************
> Ali Salehi, LSIR - Distributed Information Systems Laboratory
> EPFL-IC-IIF-LSIR, Bâtiment BC, Station 14, CH-1015 Lausanne,  
> Switzerland.
> http://lsirwww.epfl.ch/
> email: ali.salehi@epfl.ch
> Tel: +41-21-6936656 Fax: +41-21-6938115
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by Ali Salehi <al...@epfl.ch>.

Hi,
 Thanks for your previous mail.
 Now I changed the configuration to use merging factor 50. I also disabled
the compound file parameter.
 I still have the search problem I had before, now search takes around 750
msecs for a small set of documents.

    [java] Total Query Processing time (msec) : 38745
    [java] Total No. of Documents : 7,500,000
    [java] Total No. of Executed queries : 50.0
    [java] Execution time per query : 774.9 msec

 The index is optimized and its size is 830 MB.
 Each document has the following terms :
    VSID(integer), data(float), type(short int) , precision (byte).
  The queries are generate in a loop similar to one below :
loop ...
    RangeFilter rq1 = new
RangeFilter("data",+5.43243243440000,+5.43243243449999true,true);
    RangeFilter rq2 = new RangeFilter("precision",+0001,+0002,true,true);
    ChainedFilter cf = new ChainedFilter(new
Filter[]{rq2,rq1},ChainedFilter.AND);
    Query query = qp.parse("type:0");
    Hits hits = searcher.search(query,cf);
end loop

 I would like to know if there exist any solution to improve the search
time ?  (I need to insert more than 500 million of these data pages into
lucene)

Thanks,
AliS



> On Monday 08 January 2007 20:33, Ali Salehi wrote:
>
>> Â 1. The search time for simple queries such as precision:\+0002 is
>> really high (4-10 seconds). I want to know if this search time is normal
>
>> Â 2. The search gives TooManyClauses exception when I'm searching for a
>> Â data item with the queries similar to the one below :
>
> Please see the FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ:
> Why am I getting a TooManyClauses exception?
> How do I speed up searching?
>
> If that doesn't help, please re-post you question on the user list.
>
> Regards
>  Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


**************************************************************
Ali Salehi, LSIR - Distributed Information Systems Laboratory
EPFL-IC-IIF-LSIR, Bâtiment BC, Station 14, CH-1015 Lausanne, Switzerland.
http://lsirwww.epfl.ch/
email: ali.salehi@epfl.ch
Tel: +41-21-6936656 Fax: +41-21-6938115


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by Daniel Naber <lu...@danielnaber.de>.

On Monday 08 January 2007 20:33, Ali Salehi wrote:

>  1. The search time for simple queries such as precision:\+0002 is
> really high (4-10 seconds). I want to know if this search time is normal

>  2. The search gives TooManyClauses exception when I'm searching for a
>  data item with the queries similar to the one below :

Please see the FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ:
Why am I getting a TooManyClauses exception?
How do I speed up searching?

If that doesn't help, please re-post you question on the user list.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Lucene Scalability Question

Posted by Mark Miller <ma...@gmail.com>.

Holy Rabbit Batman! Take that merge factor down. 100,000 is much to high. I
think I have seen that above 90 wont help you much. 100,000 is just insane
though...try like 50 and call me in the morning.

- Mark

On 1/8/07, Ali Salehi <al...@epfl.ch> wrote:
>
> Hello,
> I have a question about the scalability of the Lucene.
> I'm a lucene beginner and I would like to use it to index several
> million measurements (400 Millions). A measurement has a type,
> owner, id, precision and data.
> As an experiment, I tried to insert 5M values into a lucene index using
> compound index with merge factor 100,000.
> For searching I have two problems :
>
> 1. The search time for simple queries such as precision:\+0002 is really
>   high (4-10 seconds). I want to know if this search time is normal
> considering the amount of data I inserted to the lucene (5 Million
> values)?
> If not, how can I improve it. I'm sure I can improve it by upgrading
> my current box (1G memory and 3.2 Ghz CPU with 2 MB cache).
> I'm looking for software/configuration solutions ?
>
> 2. The search gives TooManyClauses exception when I'm searching for a
> data item with the queries similar to the one below :
>
> precision:\+0002 AND data:\+0.85*
>
> I guess this a bug ?!
>
> Thanks for your help,
> Ali Salehi
>
>
>
> **************************************************************
> Ali Salehi, LSIR - Distributed Information Systems Laboratory
> EPFL-IC-IIF-LSIR, Bâtiment BC, Station 14, CH-1015 Lausanne, Switzerland.
> http://lsirwww.epfl.ch/
> email: ali.salehi@epfl.ch
> Tel: +41-21-6936656 Fax: +41-21-6938115
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>