You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Harini Raghavan <ha...@insideview.com> on 2006/01/02 08:26:57 UTC

Re: Query Scoring

Yes I was refering to how IDF is used in the Highlighter code to find 
out how to prioritize fragments of the documents.

My requirement is to show the relevant fragments of the news article for 
each company along with the search results. But the highlighter api 
sometimes picks up the fragments which are not so relevant to the news 
article/company. I would like to know if there is anyway that I can 
modify the scoring/ranking of these fragments in such a way that the 
news items in which a company name & keywords in the headline gets 
assigned a very strong relevancy ranking,  closely followed by a company 
name mention in the first paragraph and a  multiple-mention within the 
entire story. Something like headline =   5 points,  first paragraph = 
four, etc.

Thanks,
Harini

markharw00d wrote:

> Sorry to contradict, Erik, but the Highlighter's QueryScorer will make 
> use of IDF, given a reader, in order to better prioritise which are 
> the "best" bits of a document.
> However, In the particular example given, the criteria includes 
> several non-text fields which are not useful for IDF and general 
> scoring purposes - these are perhaps better expressed using a filter 
> of some form. Otherwise, why should the scarcity of a particular date 
> in the given range boost one matching document above others? These 
> numeric-type fields are simply mandatory boolean "hygiene factors" 
> and  should ideally play no part in highlight selection or results 
> ordering in general based on their IDF or TF.
>
> Cheers,
> Mark
>
>
> Erik Hatcher wrote:
>
>> Harini,
>>
>> I'm not sure I understand what you're asking.  IDF doesn't factor  
>> into highlighting.
>>
>> IDF calculations are useful in scoring documents during a search,  
>> such that the most relevant documents are returned, but again this 
>> is  unrelated to highlighting.
>>
>> Could you elaborate on what you're after?
>>
>>     Erik
>>
>> On Dec 30, 2005, at 12:02 PM, Harini Raghavan wrote:
>>
>>> Hi,
>>>
>>> I have a requirement to highlight search keywords in the results and
>>> display the matching fragment of the text with the results. I am using
>>> the Hits highlighting mentioned in Lucene in Action.
>>>
>>> Here is the search query(BooleanQuery) I am passing to the  
>>> IndexSearcher
>>> and QueryScorer:
>>> +DocumentType:news
>>> +(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
>>> +FilingDate:[20041201 TO 20051201]
>>> +(Content:"cost saving" Content:"cost savings" Content:outsource
>>> Content:outsources Content:downsize Content:downsizes
>>> Content:restructuring Content:restructure)
>>>
>>> I do not quite understand how the query scoring actually works &  
>>> how Inverse Document Frequency(IDF) calculations are useful?  Can
>>> someone shed some light on this using the given query as an example?
>>>
>>> Thanks,
>>> Harini
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
>
>        
> ___________________________________________________________ NEW Yahoo! 
> Cars - sell your car and browse thousands of new and used cars online! 
> http://uk.cars.yahoo.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexreader refresh

Posted by Doug Cutting <cu...@apache.org>.

Yes, that's a good start.  Your patch does not handle deletions 
correctly.  If a segment has had deletions since it was opened then its 
deletions file needs to be re-read.  I also think returning a new 
IndexReader is preferable to modifying one, since an IndexReader is 
often used as a cache key, and caches should be invalidated when an 
IndexReader is re-opened.

Robert Engels wrote:
> I proposed and posted a patch for this long ago. Only thing missing would be
> some sort of reference courting for segments (rather than the 'stayopen'
> flag).
> 
>   /**
>    * reopens the IndexReader, possibly reusing the segments for greater
> efficiency. The original IndexReader instance
>    * is closed, and the reference is no longer valid
>    *
>    * @return the new IndexReader
>    */
>   public IndexReader reopen() throws IOException {
>       if(!(this instanceof MultiReader))
>           return IndexReader.open(directory);
> 
>       MultiReader mr = (MultiReader) this;
> 
>       final IndexReader[] oldreaders = mr.getReaders();
>       final boolean[] stayopen = new boolean[oldreaders.length];
> 
>       synchronized (directory) {			  // in- & inter-process sync
>           return (IndexReader)new Lock.With(
>               directory.makeLock(IndexWriter.COMMIT_LOCK_NAME),
>               IndexWriter.COMMIT_LOCK_TIMEOUT) {
>               public Object doBody() throws IOException {
>                 SegmentInfos infos = new SegmentInfos();
>                 infos.read(directory);
>                 if (infos.size() == 1) {		  // index is optimized
>                   return new SegmentReader(infos, infos.info(0),
> closeDirectory);
>                 } else {
>                   IndexReader[] readers = new IndexReader[infos.size()];
>                   for (int i = 0; i < infos.size(); i++) {
>                       for(int j=0;j<oldreaders.length;j++) {
>                           SegmentReader sr = (SegmentReader) oldreaders[j];
>                           if(sr.si.name.equals(infos.info(i).name)) {
>                               readers[i]=sr;
>                               stayopen[j]=true;
>                           }
>                       }
>                       if(readers[i]==null)
>                           readers[i] = new SegmentReader(infos.info(i));
>                   }
> 
>                   for(int i=0;i<stayopen.length;i++)
>                       if(!stayopen[i])
>                           oldreaders[i].close();
> 
>                   return new MultiReader(directory, infos, closeDirectory,
> readers);
>                 }
>               }
>             }.run();
>         }
>   }
> 
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Wednesday, January 04, 2006 12:30 PM
> To: java-dev@lucene.apache.org
> Subject: Re: indexreader refresh
> 
> 
> Amol Bhutada wrote:
> 
>>If I have a reader and searcher on a indexdata folder and another
>>indexwriter writing documents to the same indexdata folder, do I need to
>>close existing reader and searcher and create new so that newly indexed
>>data comes into search effect?
> 
> 
> [ moved from user to dev list]
> 
> This is a frequent request.  While opening an all-new IndexReader is
> effective, it is not always efficient.  It might be nice to support a
> more efficient means of re-opening an index.
> 
> Perhaps we should add a few new IndexReader methods, as follows:
> 
> /** If <code>reader</code>'s index has not been changed, return
>    * <code>reader</code>, otherwise return a new {@link IndexReader}
>    * reading the new latest of the index
>    */
> public static IndexReader open(IndexReader reader) {
>    if (isCurrent()) {
>      // unchanged: return existing
>      return reader;
>    }
> 
>    // try to incrementally create new reader
>    IndexReader result = reader.reopen(reader);
>    if (result != null) {
>      return result;
>    }
> 
>    // punt, opening an entirely new reader
>    return IndexReader.open(reader.directory());
> }
> 
> /** Return a new IndexReader reading the current state
>    * of the index, re-using reader's resources, or null if this
>    * is not possible.
>    */
> protected IndexReader reopen(IndexReader reader) {
>    return null;
> }
> 
> Then we can add implementations of reopen to SegmentReader and
> MultiReader that attempt to re-use the existing, already opened
> segments.  This should mostly be simple, but there are a few tricky
> issues, like detecting whether an already-open segment has had
> deletions, and deciding when to close obsolete segments.
> 
> Does this sound like it would make a good addition?  Does someone want
> to volunteer to implement it?
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: indexreader refresh

Posted by Robert Engels <re...@ix.netcom.com>.

I proposed and posted a patch for this long ago. Only thing missing would be
some sort of reference courting for segments (rather than the 'stayopen'
flag).

  /**
   * reopens the IndexReader, possibly reusing the segments for greater
efficiency. The original IndexReader instance
   * is closed, and the reference is no longer valid
   *
   * @return the new IndexReader
   */
  public IndexReader reopen() throws IOException {
      if(!(this instanceof MultiReader))
          return IndexReader.open(directory);

      MultiReader mr = (MultiReader) this;

      final IndexReader[] oldreaders = mr.getReaders();
      final boolean[] stayopen = new boolean[oldreaders.length];

      synchronized (directory) {			  // in- & inter-process sync
          return (IndexReader)new Lock.With(
              directory.makeLock(IndexWriter.COMMIT_LOCK_NAME),
              IndexWriter.COMMIT_LOCK_TIMEOUT) {
              public Object doBody() throws IOException {
                SegmentInfos infos = new SegmentInfos();
                infos.read(directory);
                if (infos.size() == 1) {		  // index is optimized
                  return new SegmentReader(infos, infos.info(0),
closeDirectory);
                } else {
                  IndexReader[] readers = new IndexReader[infos.size()];
                  for (int i = 0; i < infos.size(); i++) {
                      for(int j=0;j<oldreaders.length;j++) {
                          SegmentReader sr = (SegmentReader) oldreaders[j];
                          if(sr.si.name.equals(infos.info(i).name)) {
                              readers[i]=sr;
                              stayopen[j]=true;
                          }
                      }
                      if(readers[i]==null)
                          readers[i] = new SegmentReader(infos.info(i));
                  }

                  for(int i=0;i<stayopen.length;i++)
                      if(!stayopen[i])
                          oldreaders[i].close();

                  return new MultiReader(directory, infos, closeDirectory,
readers);
                }
              }
            }.run();
        }
  }

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Wednesday, January 04, 2006 12:30 PM
To: java-dev@lucene.apache.org
Subject: Re: indexreader refresh


Amol Bhutada wrote:
> If I have a reader and searcher on a indexdata folder and another
> indexwriter writing documents to the same indexdata folder, do I need to
> close existing reader and searcher and create new so that newly indexed
> data comes into search effect?

[ moved from user to dev list]

This is a frequent request.  While opening an all-new IndexReader is
effective, it is not always efficient.  It might be nice to support a
more efficient means of re-opening an index.

Perhaps we should add a few new IndexReader methods, as follows:

/** If <code>reader</code>'s index has not been changed, return
   * <code>reader</code>, otherwise return a new {@link IndexReader}
   * reading the new latest of the index
   */
public static IndexReader open(IndexReader reader) {
   if (isCurrent()) {
     // unchanged: return existing
     return reader;
   }

   // try to incrementally create new reader
   IndexReader result = reader.reopen(reader);
   if (result != null) {
     return result;
   }

   // punt, opening an entirely new reader
   return IndexReader.open(reader.directory());
}

/** Return a new IndexReader reading the current state
   * of the index, re-using reader's resources, or null if this
   * is not possible.
   */
protected IndexReader reopen(IndexReader reader) {
   return null;
}

Then we can add implementations of reopen to SegmentReader and
MultiReader that attempt to re-use the existing, already opened
segments.  This should mostly be simple, but there are a few tricky
issues, like detecting whether an already-open segment has had
deletions, and deciding when to close obsolete segments.

Does this sound like it would make a good addition?  Does someone want
to volunteer to implement it?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: indexreader refresh

Posted by Doug Cutting <cu...@apache.org>.

Amol Bhutada wrote:
> If I have a reader and searcher on a indexdata folder and another 
> indexwriter writing documents to the same indexdata folder, do I need to 
> close existing reader and searcher and create new so that newly indexed 
> data comes into search effect?

[ moved from user to dev list]

This is a frequent request.  While opening an all-new IndexReader is 
effective, it is not always efficient.  It might be nice to support a 
more efficient means of re-opening an index.

Perhaps we should add a few new IndexReader methods, as follows:

/** If <code>reader</code>'s index has not been changed, return
   * <code>reader</code>, otherwise return a new {@link IndexReader}
   * reading the new latest of the index
   */
public static IndexReader open(IndexReader reader) {
   if (isCurrent()) {
     // unchanged: return existing
     return reader;
   }

   // try to incrementally create new reader
   IndexReader result = reader.reopen(reader);
   if (result != null) {
     return result;
   }

   // punt, opening an entirely new reader
   return IndexReader.open(reader.directory());
}

/** Return a new IndexReader reading the current state
   * of the index, re-using reader's resources, or null if this
   * is not possible.
   */
protected IndexReader reopen(IndexReader reader) {
   return null;
}

Then we can add implementations of reopen to SegmentReader and 
MultiReader that attempt to re-use the existing, already opened 
segments.  This should mostly be simple, but there are a few tricky 
issues, like detecting whether an already-open segment has had 
deletions, and deciding when to close obsolete segments.

Does this sound like it would make a good addition?  Does someone want 
to volunteer to implement it?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: indexreader refresh

Posted by Ramana Jelda <ra...@ciao-group.com>.

Hi Amol,
Yeah you should close reader to get updated index into effect.


Regards,
Jelda 

-----Original Message-----
From: Amol Bhutada [mailto:amolb@synechron.com] 
Sent: Wednesday, January 04, 2006 4:21 PM
To: java-user@lucene.apache.org
Subject: indexreader refresh

If I have a reader and searcher on a indexdata folder and another
indexwriter writing documents to the same indexdata folder, do I need to
close existing reader and searcher and create new so that newly indexed data
comes into search effect?

I have checked through google, got some pointers but some important links
are not opening now, so If you can give me a pointer or clear picture about
this it will be great.

I am looking at implementing lucene searching for a site having millions of
user records so even looking for best way to keep my indexes uptodate while
searching is going on.

thanks
Amol


--------------------------------------------------------------------
Mail Disclaimer: This e-mail and any files transmitted with it are
confidential and the views expressed in the same are not necessarily the
views of Synechron, and its Directors, Management or Employees. This
communication represents the originator's personal views and opinions. If
you are not the intended recipient or the person responsible for delivering
the e-mail to the intended recipient, be advised that you have received this
e-mail by error, and that any use, dissemination, forwarding, printing, or
copying of this e-mail is strictly prohibited. You shall be under obligation
to keep the contents of this e-mail, strictly confidential and shall not
disclose, disseminate or divulge the same to any Person, Company, Firm or
Entity. Even though Synechron uses up-to-date virus checking software to
scan it's emails please ensure you have adequate virus protection before you
open or detach any documents from this transmission. Synechron does not
accept any liability for viruses  or vulnerabilities. The rights to monitor
all e-mail communication through our network are reserved with us.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

indexreader refresh

Posted by Amol Bhutada <am...@synechron.com>.

If I have a reader and searcher on a indexdata folder and another 
indexwriter writing documents to the same indexdata folder, do I need to 
close existing reader and searcher and create new so that newly indexed 
data comes into search effect?

I have checked through google, got some pointers but some important 
links are not opening now, so If you can give me a pointer or clear 
picture about this it will be great.

I am looking at implementing lucene searching for a site having millions 
of user records so even looking for best way to keep my indexes uptodate 
while searching is going on.

thanks
Amol


--------------------------------------------------------------------
Mail Disclaimer: This e-mail and any files transmitted with it are confidential and the views expressed in the same are not necessarily the views of Synechron, and its Directors, Management or Employees. This communication represents the originator's personal views and opinions. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, be advised that you have received this e-mail by error, and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. You shall be under obligation to keep the contents of this e-mail, strictly confidential and shall not disclose, disseminate or divulge the same to any Person, Company, Firm or Entity. Even though Synechron uses up-to-date virus checking software to scan it's emails please ensure you have adequate virus protection before you open or detach any documents from this transmission. Synechron does not accept any liability for viruses 
 or vulnerabilities. The rights to monitor all e-mail communication through our network are reserved with us.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Query Scoring

Posted by Harini Raghavan <ha...@insideview.com>.

Thank you Chris. That seems like a good suggestion. I will try to pass a 
different Query object to the Highlighter api that the one used for 
searching.

I plan to break down the HTML document and store the title/sub 
title/content in different fields of the index. So if I create a new 
query comparing company name and keywords against title and content 
fields, then I am assuming that highlighter api will give a higher 
ranking to the fragment where both terms of the query match against 
those fragments where just one term(either title or content) matches. I 
am assuming that even if I do not increase the boost factor of any of 
the terms, the api will take care of this ranking.
This is my understanding of the scoring/ranking algorithm. Any comments 
anyone?

Thanks,
Harini

Chris Hostetter wrote:

>: My requirement is to show the relevant fragments of the news article for
>: each company along with the search results. But the highlighter api
>: sometimes picks up the fragments which are not so relevant to the news
>: article/company. I would like to know if there is anyway that I can
>: modify the scoring/ranking of these fragments in such a way that the
>: news items in which a company name & keywords in the headline gets
>: assigned a very strong relevancy ranking,  closely followed by a company
>: name mention in the first paragraph and a  multiple-mention within the
>: entire story. Something like headline =   5 points,  first paragraph =
>: four, etc.
>
>Well, the sample query you mentioned isn't checking any company names, or
>doing anything with a "keywords" field.  I'm not to familiar with the way
>the highlighter package works, but i imagine that with the types of
>queries you said you are using, if you are highlighting the "Content"
>field, the CompanyId and the FilingDate clauses of your query will be
>fairly irelevent (becuase they are numbers, not because they are different
>field names)
>
>An idea i've suggested before (but i don't remember if anyone ever said
>wether it is a viable use of the Highlighter or not) is to give the
>highlighter a completely different Query object then the one you used to
>get your search results.
>
>ie, if you search query (what you want used to compute score) is...
>
>  +(CompanyId:10 CompanyId:20) Content:"cost saving" Content:outsource
>
>...but once you've gotten those results, what you really care about is
>highlighting the name of the company, and you think the best fragments
>when those company names appear near the other words, then give the
>highlighter a query that looks like...
>
>  "companyname10 cost savings"~20 "companyname20 outsource"~20 ...etc
>
>
>
>: >>> Here is the search query(BooleanQuery) I am passing to the
>: >>> IndexSearcher
>: >>> and QueryScorer:
>: >>> +DocumentType:news
>: >>> +(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
>: >>> +FilingDate:[20041201 TO 20051201]
>: >>> +(Content:"cost saving" Content:"cost savings" Content:outsource
>: >>> Content:outsources Content:downsize Content:downsizes
>: >>> Content:restructuring Content:restructure)
>
>
>
>-Hoss
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Query Scoring

Posted by Chris Hostetter <ho...@fucit.org>.

: My requirement is to show the relevant fragments of the news article for
: each company along with the search results. But the highlighter api
: sometimes picks up the fragments which are not so relevant to the news
: article/company. I would like to know if there is anyway that I can
: modify the scoring/ranking of these fragments in such a way that the
: news items in which a company name & keywords in the headline gets
: assigned a very strong relevancy ranking,  closely followed by a company
: name mention in the first paragraph and a  multiple-mention within the
: entire story. Something like headline =   5 points,  first paragraph =
: four, etc.

Well, the sample query you mentioned isn't checking any company names, or
doing anything with a "keywords" field.  I'm not to familiar with the way
the highlighter package works, but i imagine that with the types of
queries you said you are using, if you are highlighting the "Content"
field, the CompanyId and the FilingDate clauses of your query will be
fairly irelevent (becuase they are numbers, not because they are different
field names)

An idea i've suggested before (but i don't remember if anyone ever said
wether it is a viable use of the Highlighter or not) is to give the
highlighter a completely different Query object then the one you used to
get your search results.

ie, if you search query (what you want used to compute score) is...

  +(CompanyId:10 CompanyId:20) Content:"cost saving" Content:outsource

...but once you've gotten those results, what you really care about is
highlighting the name of the company, and you think the best fragments
when those company names appear near the other words, then give the
highlighter a query that looks like...

  "companyname10 cost savings"~20 "companyname20 outsource"~20 ...etc



: >>> Here is the search query(BooleanQuery) I am passing to the
: >>> IndexSearcher
: >>> and QueryScorer:
: >>> +DocumentType:news
: >>> +(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
: >>> +FilingDate:[20041201 TO 20051201]
: >>> +(Content:"cost saving" Content:"cost savings" Content:outsource
: >>> Content:outsources Content:downsize Content:downsizes
: >>> Content:restructuring Content:restructure)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org