You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chun Wei Ho <cw...@gmail.com> on 2006/02/13 10:34:58 UTC

Suggesting refine searches with Lucene

Hi,

I am trying to suggest refine searches for my Lucene search. For
example, if a search turned out too many searches, it would list a
number of document title subsequences that occurred frequently in the
results of the previous search, as possible candidates for refining
the search.

Does anyone know the right/any approach to implementing this in a
Lucene-based search app?

Thanks.

CW

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Suggesting refine searches with Lucene

Posted by Chris Hostetter <ho...@fucit.org>.


Take a look at the HighFreqTerms sample class in contrib...

http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/misc/HighFreqTerms.java?rev=376393&view=log

...it doesn't meet your goal, because it returns a list of terms that
appear frequently in the whol index, not just in the results of a query.

But if you use a HitCollector (or a Filter) to generate a BitSet of all
your results, and if you modified HighFreqTerms to only count Terms where
the TermDocs contains a result in your BitSet ... then you'll be really
close to what you want.

(I say really close because it's only going to suggest individual Terms,
not "phrases" in the sense of something that would match a PhraseQuery
containing multiple Terms ... but you could allways index word ngrams of
various sizes so they would count as individual Terms)

Depending on the number of Terms in your idnex, and the number of results
in a typical search, you may be better of storing the term vectors for
each doc, iterating over the matches and using the
TermFreqVector.getTerms(), Acctually, that's probably faster in all cases.


: Date: Mon, 13 Feb 2006 17:34:58 +0800
: From: Chun Wei Ho <cw...@gmail.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Suggesting refine searches with Lucene
:
: Hi,
:
: I am trying to suggest refine searches for my Lucene search. For
: example, if a search turned out too many searches, it would list a
: number of document title subsequences that occurred frequently in the
: results of the previous search, as possible candidates for refining
: the search.
:
: Does anyone know the right/any approach to implementing this in a
: Lucene-based search app?
:
: Thanks.
:
: CW
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size + memory restrictions

Posted by Leon Chaddock <le...@macranet.co.uk>.

Hi Chris,
Thanks, when I quoted segment I meant index file.
So if we have 10 seperate index files are you saying we should have one 
indexSearcher for the index collectively, or one per index file

Thanks

Leon


----- Original Message ----- 
From: "Chris Hostetter" <ho...@fucit.org>
To: <ja...@lucene.apache.org>
Sent: Wednesday, February 15, 2006 6:40 PM
Subject: Re: Size + memory restrictions


>: We may have many different segments of our index, and it seems below we 
>are
> : using one
> : IndexSearcher per segment. Could this explain why we run out of memory 
> when
> : using more than 2/3 segments?
> : Anyone else have any comments on the below?
>
> terminology is a big issue hwere .. when you use the word "segment" it
> seems like you are talking about a segment of your data, which is in a
> self contained index in it's own right.  My point in in the comment you
> quoted was that for a given index, you don't need more then one active
> IndexSearcher open at a time, any more then that can waste resources.
>
> I don't know what kind of memory overhead there is in a MultiSearcher, but
> besides that you should also be looking at the other issues in the message
> you quoted from:   who/when is calling your getSearcher() method? ... is
> it getting called more often then the underlying indexes change?  who is
> closing the old searchers when you open new ones?
>
> :
> : Many thanks
> :
> : Leon
> : ps. At the moment I think it is set to only look at 2 segements
> :
> : private Searcher getSearcher() throws IOException {
> :   if (mSearcher == null) {
> :    synchronized (Monitor) {
> :     Searcher[] srs = new IndexSearcher[SearchersDir.size()];
> :     int maxI = 2;
> :    // Searcher[] srs = new IndexSearcher[maxI];
> :     int i = 0;
> :     for (Iterator iter = SearchersDir.iterator(); iter.hasNext() && 
> i<maxI;
> : i++) {
> :      String dir = (String) iter.next();
> :      try {
> :       srs[i] = new IndexSearcher(IndexDir+dir);
> :      } catch (IOException e) {
> :       log.error(ClassTool.getClassNameOnly(e) + ": " + e.getMessage(), 
> e);
> :      }
> :     }
> :     mSearcher = new MultiSearcher(srs);
> :     changeTime = System.currentTimeMillis();
> :    }
> :   }
> :   return mSearcher;
> :  }
> : ----- Original Message -----
> : From: "Leon Chaddock" <le...@macranet.co.uk>
> : To: <ja...@lucene.apache.org>
> : Sent: Wednesday, February 15, 2006 9:28 AM
> : Subject: Re: Size + memory restrictions
> :
> :
> : > Hi Greg,
> : > Thanks. We are actually running against 4 segments of 4gb so about 20
> : > million docs. We cant merge the segments as their seems to be problems
> : > with out linux box , with having files over about 4gb. Not sure why 
> that
> : > is.
> : >
> : > If I was to upgrade to 8gb of ram does it seem likely this will double 
> the
> : > amount of docs we can handle, or would this provide an exponential
> : > increase?
> : >
> : > Thanks
> : >
> : > Leon
> : > ----- Original Message -----
> : > From: "Greg Gershman" <gr...@yahoo.com>
> : > To: <ja...@lucene.apache.org>
> : > Sent: Wednesday, February 15, 2006 12:41 AM
> : > Subject: Re: Size + memory restrictions
> : >
> : >
> : >> You may consider incrementally adding documents to
> : >> your index; I'm not sure why there would be problems
> : >> adding to an existing index, but you can always add
> : >> additional documents.  You can optimize later to get
> : >> everything back into a single segment.
> : >>
> : >> Querying is a different story; if you are using the
> : >> Sort API, you will need enough memory to store a full
> : >> sorting of your documents in memory.  If you're trying
> : >> to sort on a string or anything other than an int or
> : >> float, this could require a lot of memory.
> : >>
> : >> I've used indices much bigger than 5 mil. docs/3.5 gb
> : >> with less than 4GB of RAM and had no problems.
> : >>
> : >> Greg
> : >>
> : >>
> : >> --- Leon Chaddock <le...@macranet.co.uk> wrote:
> : >>
> : >>> Hi,
> : >>> we are having tremendous problems building a large
> : >>> lucene index and querying
> : >>> it.
> : >>>
> : >>> The programmers are telling me that when the index
> : >>> file reaches 3.5 gb or 5
> : >>> million docs the index file can no longer grow any
> : >>> larger.
> : >>>
> : >>> To rectify this they have built index files in
> : >>> multiple directories. Now
> : >>> apparently my 4gb memory is not enough to query.
> : >>>
> : >>> Does this seem right to people or does anyone have
> : >>> any experience on largish
> : >>> scale projects.
> : >>>
> : >>> I am completely tearing my hair out here and dont
> : >>> know what to do.
> : >>>
> : >>> Thanks
> : >>>
> : >>
> : >>
> : >> __________________________________________________
> : >> Do You Yahoo!?
> : >> Tired of spam?  Yahoo! Mail has the best spam protection around
> : >> http://mail.yahoo.com
> : >>
> : >> ---------------------------------------------------------------------
> : >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : >> For additional commands, e-mail: java-user-help@lucene.apache.org
> : >>
> : >>
> : >>
> : >>
> : >>
> : >> --
> : >> Internal Virus Database is out-of-date.
> : >> Checked by AVG Free Edition.
> : >> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date:
> : >> 01/02/2006
> : >>
> : >>
> : >
> : >
> : > ---------------------------------------------------------------------
> : > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : > For additional commands, e-mail: java-user-help@lucene.apache.org
> : >
> : >
> : >
> : >
> : >
> : > --
> : > Internal Virus Database is out-of-date.
> : > Checked by AVG Free Edition.
> : > Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 
> 01/02/2006
> : >
> : >
> :
> :
> : ---------------------------------------------------------------------
> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : For additional commands, e-mail: java-user-help@lucene.apache.org
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> -- 
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition.
> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size + memory restrictions

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Leon,

Index is typically a directory on disk with files (commonly called "index files") in it.
Each index can have 1 or more segments.
Each segment is comprised of several index files.

If you are using the compound index format, then the situation is a bit different (less index files).

Otis
P.S.
You asked about Lucene in Action... :)

----- Original Message ----
From: Chris Hostetter <ho...@fucit.org>
To: java-user@lucene.apache.org
Sent: Wednesday, February 15, 2006 1:40:01 PM
Subject: Re: Size + memory restrictions

: We may have many different segments of our index, and it seems below we are
: using one
: IndexSearcher per segment. Could this explain why we run out of memory when
: using more than 2/3 segments?
: Anyone else have any comments on the below?

terminology is a big issue hwere .. when you use the word "segment" it
seems like you are talking about a segment of your data, which is in a
self contained index in it's own right.  My point in in the comment you
quoted was that for a given index, you don't need more then one active
IndexSearcher open at a time, any more then that can waste resources.

I don't know what kind of memory overhead there is in a MultiSearcher, but
besides that you should also be looking at the other issues in the message
you quoted from:   who/when is calling your getSearcher() method? ... is
it getting called more often then the underlying indexes change?  who is
closing the old searchers when you open new ones?

:
: Many thanks
:
: Leon
: ps. At the moment I think it is set to only look at 2 segements
:
: private Searcher getSearcher() throws IOException {
:   if (mSearcher == null) {
:    synchronized (Monitor) {
:     Searcher[] srs = new IndexSearcher[SearchersDir.size()];
:     int maxI = 2;
:    // Searcher[] srs = new IndexSearcher[maxI];
:     int i = 0;
:     for (Iterator iter = SearchersDir.iterator(); iter.hasNext() && i<maxI;
: i++) {
:      String dir = (String) iter.next();
:      try {
:       srs[i] = new IndexSearcher(IndexDir+dir);
:      } catch (IOException e) {
:       log.error(ClassTool.getClassNameOnly(e) + ": " + e.getMessage(), e);
:      }
:     }
:     mSearcher = new MultiSearcher(srs);
:     changeTime = System.currentTimeMillis();
:    }
:   }
:   return mSearcher;
:  }
: ----- Original Message -----
: From: "Leon Chaddock" <le...@macranet.co.uk>
: To: <ja...@lucene.apache.org>
: Sent: Wednesday, February 15, 2006 9:28 AM
: Subject: Re: Size + memory restrictions
:
:
: > Hi Greg,
: > Thanks. We are actually running against 4 segments of 4gb so about 20
: > million docs. We cant merge the segments as their seems to be problems
: > with out linux box , with having files over about 4gb. Not sure why that
: > is.
: >
: > If I was to upgrade to 8gb of ram does it seem likely this will double the
: > amount of docs we can handle, or would this provide an exponential
: > increase?
: >
: > Thanks
: >
: > Leon
: > ----- Original Message -----
: > From: "Greg Gershman" <gr...@yahoo.com>
: > To: <ja...@lucene.apache.org>
: > Sent: Wednesday, February 15, 2006 12:41 AM
: > Subject: Re: Size + memory restrictions
: >
: >
: >> You may consider incrementally adding documents to
: >> your index; I'm not sure why there would be problems
: >> adding to an existing index, but you can always add
: >> additional documents.  You can optimize later to get
: >> everything back into a single segment.
: >>
: >> Querying is a different story; if you are using the
: >> Sort API, you will need enough memory to store a full
: >> sorting of your documents in memory.  If you're trying
: >> to sort on a string or anything other than an int or
: >> float, this could require a lot of memory.
: >>
: >> I've used indices much bigger than 5 mil. docs/3.5 gb
: >> with less than 4GB of RAM and had no problems.
: >>
: >> Greg
: >>
: >>
: >> --- Leon Chaddock <le...@macranet.co.uk> wrote:
: >>
: >>> Hi,
: >>> we are having tremendous problems building a large
: >>> lucene index and querying
: >>> it.
: >>>
: >>> The programmers are telling me that when the index
: >>> file reaches 3.5 gb or 5
: >>> million docs the index file can no longer grow any
: >>> larger.
: >>>
: >>> To rectify this they have built index files in
: >>> multiple directories. Now
: >>> apparently my 4gb memory is not enough to query.
: >>>
: >>> Does this seem right to people or does anyone have
: >>> any experience on largish
: >>> scale projects.
: >>>
: >>> I am completely tearing my hair out here and dont
: >>> know what to do.
: >>>
: >>> Thanks
: >>>
: >>
: >>
: >> __________________________________________________
: >> Do You Yahoo!?
: >> Tired of spam?  Yahoo! Mail has the best spam protection around
: >> http://mail.yahoo.com
: >>
: >> ---------------------------------------------------------------------
: >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: >> For additional commands, e-mail: java-user-help@lucene.apache.org
: >>
: >>
: >>
: >>
: >>
: >> --
: >> Internal Virus Database is out-of-date.
: >> Checked by AVG Free Edition.
: >> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date:
: >> 01/02/2006
: >>
: >>
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
: >
: >
: >
: > --
: > Internal Virus Database is out-of-date.
: > Checked by AVG Free Edition.
: > Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006
: >
: >
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size + memory restrictions

Posted by Chris Hostetter <ho...@fucit.org>.

: We may have many different segments of our index, and it seems below we are
: using one
: IndexSearcher per segment. Could this explain why we run out of memory when
: using more than 2/3 segments?
: Anyone else have any comments on the below?

terminology is a big issue hwere .. when you use the word "segment" it
seems like you are talking about a segment of your data, which is in a
self contained index in it's own right.  My point in in the comment you
quoted was that for a given index, you don't need more then one active
IndexSearcher open at a time, any more then that can waste resources.

I don't know what kind of memory overhead there is in a MultiSearcher, but
besides that you should also be looking at the other issues in the message
you quoted from:   who/when is calling your getSearcher() method? ... is
it getting called more often then the underlying indexes change?  who is
closing the old searchers when you open new ones?

:
: Many thanks
:
: Leon
: ps. At the moment I think it is set to only look at 2 segements
:
: private Searcher getSearcher() throws IOException {
:   if (mSearcher == null) {
:    synchronized (Monitor) {
:     Searcher[] srs = new IndexSearcher[SearchersDir.size()];
:     int maxI = 2;
:    // Searcher[] srs = new IndexSearcher[maxI];
:     int i = 0;
:     for (Iterator iter = SearchersDir.iterator(); iter.hasNext() && i<maxI;
: i++) {
:      String dir = (String) iter.next();
:      try {
:       srs[i] = new IndexSearcher(IndexDir+dir);
:      } catch (IOException e) {
:       log.error(ClassTool.getClassNameOnly(e) + ": " + e.getMessage(), e);
:      }
:     }
:     mSearcher = new MultiSearcher(srs);
:     changeTime = System.currentTimeMillis();
:    }
:   }
:   return mSearcher;
:  }
: ----- Original Message -----
: From: "Leon Chaddock" <le...@macranet.co.uk>
: To: <ja...@lucene.apache.org>
: Sent: Wednesday, February 15, 2006 9:28 AM
: Subject: Re: Size + memory restrictions
:
:
: > Hi Greg,
: > Thanks. We are actually running against 4 segments of 4gb so about 20
: > million docs. We cant merge the segments as their seems to be problems
: > with out linux box , with having files over about 4gb. Not sure why that
: > is.
: >
: > If I was to upgrade to 8gb of ram does it seem likely this will double the
: > amount of docs we can handle, or would this provide an exponential
: > increase?
: >
: > Thanks
: >
: > Leon
: > ----- Original Message -----
: > From: "Greg Gershman" <gr...@yahoo.com>
: > To: <ja...@lucene.apache.org>
: > Sent: Wednesday, February 15, 2006 12:41 AM
: > Subject: Re: Size + memory restrictions
: >
: >
: >> You may consider incrementally adding documents to
: >> your index; I'm not sure why there would be problems
: >> adding to an existing index, but you can always add
: >> additional documents.  You can optimize later to get
: >> everything back into a single segment.
: >>
: >> Querying is a different story; if you are using the
: >> Sort API, you will need enough memory to store a full
: >> sorting of your documents in memory.  If you're trying
: >> to sort on a string or anything other than an int or
: >> float, this could require a lot of memory.
: >>
: >> I've used indices much bigger than 5 mil. docs/3.5 gb
: >> with less than 4GB of RAM and had no problems.
: >>
: >> Greg
: >>
: >>
: >> --- Leon Chaddock <le...@macranet.co.uk> wrote:
: >>
: >>> Hi,
: >>> we are having tremendous problems building a large
: >>> lucene index and querying
: >>> it.
: >>>
: >>> The programmers are telling me that when the index
: >>> file reaches 3.5 gb or 5
: >>> million docs the index file can no longer grow any
: >>> larger.
: >>>
: >>> To rectify this they have built index files in
: >>> multiple directories. Now
: >>> apparently my 4gb memory is not enough to query.
: >>>
: >>> Does this seem right to people or does anyone have
: >>> any experience on largish
: >>> scale projects.
: >>>
: >>> I am completely tearing my hair out here and dont
: >>> know what to do.
: >>>
: >>> Thanks
: >>>
: >>
: >>
: >> __________________________________________________
: >> Do You Yahoo!?
: >> Tired of spam?  Yahoo! Mail has the best spam protection around
: >> http://mail.yahoo.com
: >>
: >> ---------------------------------------------------------------------
: >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: >> For additional commands, e-mail: java-user-help@lucene.apache.org
: >>
: >>
: >>
: >>
: >>
: >> --
: >> Internal Virus Database is out-of-date.
: >> Checked by AVG Free Edition.
: >> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date:
: >> 01/02/2006
: >>
: >>
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
: >
: >
: >
: > --
: > Internal Virus Database is out-of-date.
: > Checked by AVG Free Edition.
: > Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006
: >
: >
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size + memory restrictions

Posted by Leon Chaddock <le...@macranet.co.uk>.

 Looking into the memory problems further I read

"Every time you open an IndexSearcher/IndexReader resources are used which
take up memory.  for an application pointed at a static index, you only
ever need one IndexReader/IndexSearcher that can be shared among multiple
threads issuing queries.  if your index is being incrimentally updated,
you should never need more then two searcher/reader pairs open at a time"

We may have many different segments of our index, and it seems below we are 
using one
IndexSearcher per segment. Could this explain why we run out of memory when 
using more than 2/3 segments?
Anyone else have any comments on the below?

Many thanks

Leon
ps. At the moment I think it is set to only look at 2 segements

private Searcher getSearcher() throws IOException {
  if (mSearcher == null) {
   synchronized (Monitor) {
    Searcher[] srs = new IndexSearcher[SearchersDir.size()];
    int maxI = 2;
   // Searcher[] srs = new IndexSearcher[maxI];
    int i = 0;
    for (Iterator iter = SearchersDir.iterator(); iter.hasNext() && i<maxI; 
i++) {
     String dir = (String) iter.next();
     try {
      srs[i] = new IndexSearcher(IndexDir+dir);
     } catch (IOException e) {
      log.error(ClassTool.getClassNameOnly(e) + ": " + e.getMessage(), e);
     }
    }
    mSearcher = new MultiSearcher(srs);
    changeTime = System.currentTimeMillis();
   }
  }
  return mSearcher;
 }
----- Original Message ----- 
From: "Leon Chaddock" <le...@macranet.co.uk>
To: <ja...@lucene.apache.org>
Sent: Wednesday, February 15, 2006 9:28 AM
Subject: Re: Size + memory restrictions


> Hi Greg,
> Thanks. We are actually running against 4 segments of 4gb so about 20 
> million docs. We cant merge the segments as their seems to be problems 
> with out linux box , with having files over about 4gb. Not sure why that 
> is.
>
> If I was to upgrade to 8gb of ram does it seem likely this will double the 
> amount of docs we can handle, or would this provide an exponential 
> increase?
>
> Thanks
>
> Leon
> ----- Original Message ----- 
> From: "Greg Gershman" <gr...@yahoo.com>
> To: <ja...@lucene.apache.org>
> Sent: Wednesday, February 15, 2006 12:41 AM
> Subject: Re: Size + memory restrictions
>
>
>> You may consider incrementally adding documents to
>> your index; I'm not sure why there would be problems
>> adding to an existing index, but you can always add
>> additional documents.  You can optimize later to get
>> everything back into a single segment.
>>
>> Querying is a different story; if you are using the
>> Sort API, you will need enough memory to store a full
>> sorting of your documents in memory.  If you're trying
>> to sort on a string or anything other than an int or
>> float, this could require a lot of memory.
>>
>> I've used indices much bigger than 5 mil. docs/3.5 gb
>> with less than 4GB of RAM and had no problems.
>>
>> Greg
>>
>>
>> --- Leon Chaddock <le...@macranet.co.uk> wrote:
>>
>>> Hi,
>>> we are having tremendous problems building a large
>>> lucene index and querying
>>> it.
>>>
>>> The programmers are telling me that when the index
>>> file reaches 3.5 gb or 5
>>> million docs the index file can no longer grow any
>>> larger.
>>>
>>> To rectify this they have built index files in
>>> multiple directories. Now
>>> apparently my 4gb memory is not enough to query.
>>>
>>> Does this seem right to people or does anyone have
>>> any experience on largish
>>> scale projects.
>>>
>>> I am completely tearing my hair out here and dont
>>> know what to do.
>>>
>>> Thanks
>>>
>>
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Tired of spam?  Yahoo! Mail has the best spam protection around
>> http://mail.yahoo.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>> -- 
>> Internal Virus Database is out-of-date.
>> Checked by AVG Free Edition.
>> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 
>> 01/02/2006
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> -- 
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition.
> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size + memory restrictions

Posted by Leon Chaddock <le...@macranet.co.uk>.

Hi Greg,
Thanks. We are actually running against 4 segments of 4gb so about 20 
million docs. We cant merge the segments as their seems to be problems with 
out linux box , with having files over about 4gb. Not sure why that is.

If I was to upgrade to 8gb of ram does it seem likely this will double the 
amount of docs we can handle, or would this provide an exponential increase?

Thanks

Leon
----- Original Message ----- 
From: "Greg Gershman" <gr...@yahoo.com>
To: <ja...@lucene.apache.org>
Sent: Wednesday, February 15, 2006 12:41 AM
Subject: Re: Size + memory restrictions


> You may consider incrementally adding documents to
> your index; I'm not sure why there would be problems
> adding to an existing index, but you can always add
> additional documents.  You can optimize later to get
> everything back into a single segment.
>
> Querying is a different story; if you are using the
> Sort API, you will need enough memory to store a full
> sorting of your documents in memory.  If you're trying
> to sort on a string or anything other than an int or
> float, this could require a lot of memory.
>
> I've used indices much bigger than 5 mil. docs/3.5 gb
> with less than 4GB of RAM and had no problems.
>
> Greg
>
>
> --- Leon Chaddock <le...@macranet.co.uk> wrote:
>
>> Hi,
>> we are having tremendous problems building a large
>> lucene index and querying
>> it.
>>
>> The programmers are telling me that when the index
>> file reaches 3.5 gb or 5
>> million docs the index file can no longer grow any
>> larger.
>>
>> To rectify this they have built index files in
>> multiple directories. Now
>> apparently my 4gb memory is not enough to query.
>>
>> Does this seem right to people or does anyone have
>> any experience on largish
>> scale projects.
>>
>> I am completely tearing my hair out here and dont
>> know what to do.
>>
>> Thanks
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> -- 
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition.
> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size + memory restrictions

Posted by Greg Gershman <gr...@yahoo.com>.

You may consider incrementally adding documents to
your index; I'm not sure why there would be problems
adding to an existing index, but you can always add
additional documents.  You can optimize later to get
everything back into a single segment.

Querying is a different story; if you are using the
Sort API, you will need enough memory to store a full
sorting of your documents in memory.  If you're trying
to sort on a string or anything other than an int or
float, this could require a lot of memory.

I've used indices much bigger than 5 mil. docs/3.5 gb
with less than 4GB of RAM and had no problems.

Greg

--- Leon Chaddock <le...@macranet.co.uk> wrote:

> Hi,
> we are having tremendous problems building a large
> lucene index and querying 
> it.
> 
> The programmers are telling me that when the index
> file reaches 3.5 gb or 5 
> million docs the index file can no longer grow any
> larger.
> 
> To rectify this they have built index files in
> multiple directories. Now 
> apparently my 4gb memory is not enough to query.
> 
> Does this seem right to people or does anyone have
> any experience on largish 
> scale projects.
> 
> I am completely tearing my hair out here and dont
> know what to do.
> 
> Thanks
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Size + memory restrictions

Posted by Leon Chaddock <le...@macranet.co.uk>.

Hi,
we are having tremendous problems building a large lucene index and querying 
it.

The programmers are telling me that when the index file reaches 3.5 gb or 5 
million docs the index file can no longer grow any larger.

To rectify this they have built index files in multiple directories. Now 
apparently my 4gb memory is not enough to query.

Does this seem right to people or does anyone have any experience on largish 
scale projects.

I am completely tearing my hair out here and dont know what to do.

Thanks

Leon
----- Original Message ----- 
From: "Chun Wei Ho" <cw...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Monday, February 13, 2006 10:41 AM
Subject: Re: Suggesting refine searches with Lucene


Thanks. But I am actually looking for approaches/libraries which will
help me to come up with the suggested "refine searches".

For example I might search for "accident" on the headlines at a news
site, which would come back with lots of hits. I am looking for
something that would analyze the headlines (or some other specified
text field) of all those hits and come up with a list of refined
searches that would return a specific/considerable subset of the
results, e.g. "Traffic", "plane", "boating", etc, being frequent
occurrences of headline text in news that include "accident" in the
headlines.

I guess its a matter of finding frequently occurring subsequences with
some intelligent guessing but I was hoping that someone else better
would have already done it in a library that I could adapt.

Regards,
CW


On 2/13/06, Ravi <ra...@siti.com> wrote:
> Hi ,
>
>
> I have implemented by using query "mergeBooleanQueries" method... in this
> approach I have created one POJO class of RefineQuery which contains one
> variable called Query and I will set whenever I get a search..
> And next time if it is a refined search I will merge current query with 
> the
> refinedquery object and get new query and pass to lucene and set the new
> query to refined search object .... this is working fine.. let me know if
> have any further ideas or any new technique to implement refined search
>
>
>
> if(objSearchParameters.isBSearchInSearch()){
>         Query q2                =
> Query.mergeBooleanQueries(new Query[]{  objRefineQuery.getQuery(),
> booleanQuery });
>                 objRefineQuery.setQuery(q2);
>         hits    =        searcher.search(q2);
>             }else{
>                 objRefineQuery.setQuery(booleanQuery);
>         hits    = searcher.search(booleanQuery);
>             }
>
>
>
>
>
>
>
> public class RefineQuery {
>
>         private Query   query = null;
>
>
>         public Query getQuery() {
>                 return query;
>         }
>
>
>         public void setQuery(Query query) {
>                 this.query = query;
>         }
>
>
>         public String toString(){
>           return query.toString();
>         }
>
> }
>
>
>
>
> Regards,
> Ravi Kumar Jaladanki
>
> -----Original Message-----
> From: Chun Wei Ho [mailto:cwho.work@gmail.com]
> Sent: Monday, February 13, 2006 3:05 PM
> To: java-user@lucene.apache.org
> Subject: Suggesting refine searches with Lucene
>
> Hi,
>
> I am trying to suggest refine searches for my Lucene search. For
> example, if a search turned out too many searches, it would list a
> number of document title subsequences that occurred frequently in the
> results of the previous search, as possible candidates for refining
> the search.
>
> Does anyone know the right/any approach to implementing this in a
> Lucene-based search app?
>
> Thanks.
>
> CW
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





-- 
Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Suggesting refine searches with Lucene

Posted by Ben <ne...@gmail.com>.

I may be wrong but isn't this what Carrot2 does?

-Ben

On 2/13/06, Chun Wei Ho <cw...@gmail.com> wrote:
>
> Thanks. But I am actually looking for approaches/libraries which will
> help me to come up with the suggested "refine searches".
>
> For example I might search for "accident" on the headlines at a news
> site, which would come back with lots of hits. I am looking for
> something that would analyze the headlines (or some other specified
> text field) of all those hits and come up with a list of refined
> searches that would return a specific/considerable subset of the
> results, e.g. "Traffic", "plane", "boating", etc, being frequent
> occurrences of headline text in news that include "accident" in the
> headlines.
>
> I guess its a matter of finding frequently occurring subsequences with
> some intelligent guessing but I was hoping that someone else better
> would have already done it in a library that I could adapt.
>
> Regards,
> CW
>
>
> On 2/13/06, Ravi <ra...@siti.com> wrote:
> > Hi ,
> >
> >
> > I have implemented by using query "mergeBooleanQueries" method... in
> this
> > approach I have created one POJO class of RefineQuery which contains one
> > variable called Query and I will set whenever I get a search..
> > And next time if it is a refined search I will merge current query
> with  the
> > refinedquery object and get new query and pass to lucene and set the new
> > query to refined search object .... this is working fine.. let me know
> if
> > have any further ideas or any new technique to implement refined search
> >
> >
> >
> > if(objSearchParameters.isBSearchInSearch()){
> >         Query q2                =
> > Query.mergeBooleanQueries(new Query[]{  objRefineQuery.getQuery(),
> > booleanQuery });
> >                 objRefineQuery.setQuery(q2);
> >         hits    =        searcher.search(q2);
> >             }else{
> >                 objRefineQuery.setQuery(booleanQuery);
> >         hits    = searcher.search(booleanQuery);
> >             }
> >
> >
> >
> >
> >
> >
> >
> > public class RefineQuery {
> >
> >         private Query   query = null;
> >
> >
> >         public Query getQuery() {
> >                 return query;
> >         }
> >
> >
> >         public void setQuery(Query query) {
> >                 this.query = query;
> >         }
> >
> >
> >         public String toString(){
> >           return query.toString();
> >         }
> >
> > }
> >
> >
> >
> >
> > Regards,
> > Ravi Kumar Jaladanki
> >
> > -----Original Message-----
> > From: Chun Wei Ho [mailto:cwho.work@gmail.com]
> > Sent: Monday, February 13, 2006 3:05 PM
> > To: java-user@lucene.apache.org
> > Subject: Suggesting refine searches with Lucene
> >
> > Hi,
> >
> > I am trying to suggest refine searches for my Lucene search. For
> > example, if a search turned out too many searches, it would list a
> > number of document title subsequences that occurred frequently in the
> > results of the previous search, as possible candidates for refining
> > the search.
> >
> > Does anyone know the right/any approach to implementing this in a
> > Lucene-based search app?
> >
> > Thanks.
> >
> > CW
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Suggesting refine searches with Lucene

Posted by Chun Wei Ho <cw...@gmail.com>.

Thanks. But I am actually looking for approaches/libraries which will
help me to come up with the suggested "refine searches".

For example I might search for "accident" on the headlines at a news
site, which would come back with lots of hits. I am looking for
something that would analyze the headlines (or some other specified
text field) of all those hits and come up with a list of refined
searches that would return a specific/considerable subset of the
results, e.g. "Traffic", "plane", "boating", etc, being frequent
occurrences of headline text in news that include "accident" in the
headlines.

I guess its a matter of finding frequently occurring subsequences with
some intelligent guessing but I was hoping that someone else better
would have already done it in a library that I could adapt.

Regards,
CW


On 2/13/06, Ravi <ra...@siti.com> wrote:
> Hi ,
>
>
> I have implemented by using query "mergeBooleanQueries" method... in this
> approach I have created one POJO class of RefineQuery which contains one
> variable called Query and I will set whenever I get a search..
> And next time if it is a refined search I will merge current query with  the
> refinedquery object and get new query and pass to lucene and set the new
> query to refined search object .... this is working fine.. let me know if
> have any further ideas or any new technique to implement refined search
>
>
>
> if(objSearchParameters.isBSearchInSearch()){
>         Query q2                =
> Query.mergeBooleanQueries(new Query[]{  objRefineQuery.getQuery(),
> booleanQuery });
>                 objRefineQuery.setQuery(q2);
>         hits    =        searcher.search(q2);
>             }else{
>                 objRefineQuery.setQuery(booleanQuery);
>         hits    = searcher.search(booleanQuery);
>             }
>
>
>
>
>
>
>
> public class RefineQuery {
>
>         private Query   query = null;
>
>
>         public Query getQuery() {
>                 return query;
>         }
>
>
>         public void setQuery(Query query) {
>                 this.query = query;
>         }
>
>
>         public String toString(){
>           return query.toString();
>         }
>
> }
>
>
>
>
> Regards,
> Ravi Kumar Jaladanki
>
> -----Original Message-----
> From: Chun Wei Ho [mailto:cwho.work@gmail.com]
> Sent: Monday, February 13, 2006 3:05 PM
> To: java-user@lucene.apache.org
> Subject: Suggesting refine searches with Lucene
>
> Hi,
>
> I am trying to suggest refine searches for my Lucene search. For
> example, if a search turned out too many searches, it would list a
> number of document title subsequences that occurred frequently in the
> results of the previous search, as possible candidates for refining
> the search.
>
> Does anyone know the right/any approach to implementing this in a
> Lucene-based search app?
>
> Thanks.
>
> CW
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Suggesting refine searches with Lucene

Posted by Koji Sekiguchi <ko...@m4.dion.ne.jp>.

I may misunderstand your needs, but isn't it relevance feedback?
Please check Grant Ingersoll's presentation at ApacheCon 2005.
He put out great demo programs for the relevance feedback using Lucene.

Thank you,

Koji

> -----Original Message-----
> From: Chun Wei Ho [mailto:cwho.work@gmail.com]
> Sent: Monday, February 13, 2006 6:35 PM
> To: java-user@lucene.apache.org
> Subject: Suggesting refine searches with Lucene
> 
> 
> Hi,
> 
> I am trying to suggest refine searches for my Lucene search. For
> example, if a search turned out too many searches, it would list a
> number of document title subsequences that occurred frequently in the
> results of the previous search, as possible candidates for refining
> the search.
> 
> Does anyone know the right/any approach to implementing this in a
> Lucene-based search app?
> 
> Thanks.
> 
> CW
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: Suggesting refine searches with Lucene

Posted by Klaus <kl...@vommond.de>.

A simple approach is to count the most common words in the result set and
present them in combination with the original query. If you have any meta
information you could use them the refine the query.

-----Ursprüngliche Nachricht-----
Von: Chun Wei Ho [mailto:cwho.work@gmail.com] 
Gesendet: Montag, 13. Februar 2006 10:35
An: java-user@lucene.apache.org
Betreff: Suggesting refine searches with Lucene

Hi,

I am trying to suggest refine searches for my Lucene search. For
example, if a search turned out too many searches, it would list a
number of document title subsequences that occurred frequently in the
results of the previous search, as possible candidates for refining
the search.

Does anyone know the right/any approach to implementing this in a
Lucene-based search app?

Thanks.

CW

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: Suggesting refine searches with Lucene

Posted by Klaus <kl...@vommond.de>.

>And next time if it is a refined search I will merge current query with  

How do you recognize a refined query? And how are you the queries refined?

Cheers,

klaus


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Suggesting refine searches with Lucene

Posted by Ravi <ra...@siti.com>.

Hi ,


I have implemented by using query "mergeBooleanQueries" method... in this
approach I have created one POJO class of RefineQuery which contains one
variable called Query and I will set whenever I get a search..
And next time if it is a refined search I will merge current query with  the
refinedquery object and get new query and pass to lucene and set the new
query to refined search object .... this is working fine.. let me know if
have any further ideas or any new technique to implement refined search



if(objSearchParameters.isBSearchInSearch()){
     	Query q2		=
Query.mergeBooleanQueries(new Query[]{ 	objRefineQuery.getQuery(),
booleanQuery });
          	objRefineQuery.setQuery(q2);
    	hits 	= 	 searcher.search(q2);
            }else{
          	objRefineQuery.setQuery(booleanQuery);
	hits	= searcher.search(booleanQuery); 
            }







public class RefineQuery {

	private Query   query = null;

	
	public Query getQuery() {
		return query;
	}


	public void setQuery(Query query) {
		this.query = query;
	}


	public String toString(){
	  return query.toString();	
	}
	
}




Regards,
Ravi Kumar Jaladanki

-----Original Message-----
From: Chun Wei Ho [mailto:cwho.work@gmail.com] 
Sent: Monday, February 13, 2006 3:05 PM
To: java-user@lucene.apache.org
Subject: Suggesting refine searches with Lucene

Hi,

I am trying to suggest refine searches for my Lucene search. For
example, if a search turned out too many searches, it would list a
number of document title subsequences that occurred frequently in the
results of the previous search, as possible candidates for refining
the search.

Does anyone know the right/any approach to implementing this in a
Lucene-based search app?

Thanks.

CW

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org