You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Harini Raghavan <ha...@insideview.com> on 2005/12/01 06:08:41 UTC

Re: how to control terms to be highlighted?

Hi Mark,

It would be great if you can make this change and send the
QueryTermsExtractor class. I am invoking the QueryScorer(Query)
contructor. Should I use QueryScorer(Query query, IndexReader reader,
String fieldName) instead for this to work?

Thanks,
Harini

mark harwood wrote:

>>>>Is there anyway to restrict the highlighter to
>>>>        
>>>>
>>highlight only the values 
>>mentioned for the field 'Content'?
>>    
>>
>
>The problem lies in the QueryTermsExtractor class
>which is typically used to provide the Highlighter
>with the list of strings to identify in the text. It
>currently has no filter for fieldname - you could add
>this without too much effort.
>
>I could make this modification but it may change the
>behaviour of existing applications - currently the
>QueryTermsExtractor method that takes a fieldname only
>uses that fieldname to derive IDF weightings, the
>proposed change would also have the effect of
>filtering out any query terms that weren't for this
>field. 
>Would this change be a problem for anyone?
>
>Cheers,
>Mark
>
>--- Harini Raghavan <ha...@insideview.com>
>wrote:
>
>  
>
>>Hi,
>>
>>I have a requirement to highlight search keywords in
>>the results and 
>>display the matching fragment of the text with the
>>results. I am using 
>>the Hits highlighting mentioned in Lucene in Action.
>>
>>Here is the search query(BooleanQuery) I am passing
>>to the IndexSearcher 
>>and QueryScorer:
>> +DocumentType:news
>> +(CompanyId:10 CompanyId:20 CompanyId:30
>>CompanyId:40)
>> +FilingDate:[20041201 TO 20051201]
>> +(Content:"cost saving" Content:"cost savings"
>>Content:outsource 
>>Content:outsources Content:downsize
>>Content:downsizes 
>>Content:restructuring Content:restructure)
>>
>>My requirement is to highlight only the keywords for
>>'Content' field, 
>>but the highlighter api is also highlighting words
>>like 'news', '10', 
>>'40' etc.
>>Is there anyway to restrict the highlighter to
>>highlight only the values 
>>mentioned for the field 'Content'?
>>
>>Thanks,
>>Harini
>>
>>
>>
>>
>>
>>
>>    
>>
>---------------------------------------------------------------------
>  
>
>>To unsubscribe, e-mail:
>>java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail:
>>java-user-help@lucene.apache.org
>>
>>
>>    
>>
>
>
>
>		
>___________________________________________________________ 
>Yahoo! Model Search 2005 - Find the next catwalk superstars - http://uk.news.yahoo.com/hot/model-search/
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to control terms to be highlighted?

Posted by mark harwood <ma...@yahoo.co.uk>.

Looks like you may need to do more intelligent parsing
of the source document.

In the specific example you gave the text comes from a
drop-down combo in a form unrelated to the article
text I imagine you are interested in. There is a lot
of other "guff" around the edges to do with related
news stories I see as hard to remove too without
special knowledge of that page structure.

All assuming of course this doesn't violate the Yahoo
TOS.




		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to control terms to be highlighted?

Posted by Harini Raghavan <ha...@insideview.com>.

Hi,

I was able to use the Highlighter API to extract the text where the 
keywords occur.  However I am facing another related problem. My 
application downloads the news items to the local server. The indexer 
api parses these HTML files and extracts the content and stores it in 
the index. The parser extracts all the text in the html page including 
title, headings etc. So when the highlighter is run on this content, 
instead of highlighting the keywords in the main content, it just shows 
the title or words found in the beginning of the page.

For example for the article in the link:  
http://biz.yahoo.com/rb/051130/apple.html , the highlighted text is 
something like below:
/Options Order Book Symbol Lookup Reuters Apple may launch Intel 
laptops: analyst Wednesday November 30, 10:24 am ET NEW

/My requirement is to extract the best fragment/sentence from the news 
article where the keywords appear(similar to google) and display below 
the search result. But, the above text extracted is not really the best 
fragment, it seems to be the first fragment which has the keywords.  Has 
someone implemented this kind of functionality?

-Harini



Harini Raghavan wrote:

> Hi Chris,
>
> Can we pass a different query object for searching and a different one 
> to the highlighter? I am not sure of that.
> In any case,  based on Mark's suggestion I modified the 
> QueryTermsExtractor class and filtered the query  terms by the fieldName.
> Attached is the modified file.
>
> Thanks,
> Harini
>
>
>
> Chris Hostetter wrote:
>
>> I don't know what your application is, and I have no experience with the
>> Highlighter code, so forgive me if this is a silly suggestion:
>>
>> It looks like you are building a query up programaticaly, which
>> contains some words to search on, and some other stuff that's mainly
>> being used to "filter" the results (i'll avoid my usual rant about
>> people underutilizing Filters).  So why not pass the Higherlighter just
>> the portion of the Query that you acctaully want to contribute to the
>> highlighting?  In this query...
>>
>> : >> +DocumentType:news
>> : >> +(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
>> : >> +FilingDate:[20041201 TO 20051201]
>> : >> +(Content:"cost saving" Content:"cost savings"
>> : >>Content:outsource
>> : >>Content:outsources Content:downsize
>> : >>Content:downsizes
>> : >>Content:restructuring Content:restructure)
>>
>> ...just give the highlighter...
>>
>>    (Content:"cost saving" Content:"cost savings"
>>     Content:outsource
>>     Content:outsources Content:downsize
>>     Content:downsizes
>>     Content:restructuring Content:restructure)
>>
>>
>> : Date: Thu, 01 Dec 2005 10:38:41 +0530
>> : From: Harini Raghavan <ha...@insideview.com>
>> : Reply-To: java-user@lucene.apache.org
>> : To: java-user@lucene.apache.org
>> : Subject: Re: how to control terms to be highlighted?
>> :
>> : Hi Mark,
>> :
>> : It would be great if you can make this change and send the
>> : QueryTermsExtractor class. I am invoking the QueryScorer(Query)
>> : contructor. Should I use QueryScorer(Query query, IndexReader reader,
>> : String fieldName) instead for this to work?
>> :
>> : Thanks,
>> : Harini
>> :
>> : mark harwood wrote:
>> :
>> : >>>>Is there anyway to restrict the highlighter to
>> : >>>>
>> : >>>>
>> : >>highlight only the values
>> : >>mentioned for the field 'Content'?
>> : >>
>> : >>
>> : >
>> : >The problem lies in the QueryTermsExtractor class
>> : >which is typically used to provide the Highlighter
>> : >with the list of strings to identify in the text. It
>> : >currently has no filter for fieldname - you could add
>> : >this without too much effort.
>> : >
>> : >I could make this modification but it may change the
>> : >behaviour of existing applications - currently the
>> : >QueryTermsExtractor method that takes a fieldname only
>> : >uses that fieldname to derive IDF weightings, the
>> : >proposed change would also have the effect of
>> : >filtering out any query terms that weren't for this
>> : >field.
>> : >Would this change be a problem for anyone?
>> : >
>> : >Cheers,
>> : >Mark
>> : >
>> : >--- Harini Raghavan <ha...@insideview.com>
>> : >wrote:
>> : >
>> : >
>> : >
>> : >>Hi,
>> : >>
>> : >>I have a requirement to highlight search keywords in
>> : >>the results and
>> : >>display the matching fragment of the text with the
>> : >>results. I am using
>> : >>the Hits highlighting mentioned in Lucene in Action.
>> : >>
>> : >>Here is the search query(BooleanQuery) I am passing
>> : >>to the IndexSearcher
>> : >>and QueryScorer:
>> : >> +DocumentType:news
>> : >> +(CompanyId:10 CompanyId:20 CompanyId:30
>> : >>CompanyId:40)
>> : >> +FilingDate:[20041201 TO 20051201]
>> : >> +(Content:"cost saving" Content:"cost savings"
>> : >>Content:outsource
>> : >>Content:outsources Content:downsize
>> : >>Content:downsizes
>> : >>Content:restructuring Content:restructure)
>> : >>
>> : >>My requirement is to highlight only the keywords for
>> : >>'Content' field,
>> : >>but the highlighter api is also highlighting words
>> : >>like 'news', '10',
>> : >>'40' etc.
>> : >>Is there anyway to restrict the highlighter to
>> : >>highlight only the values
>> : >>mentioned for the field 'Content'?
>> : >>
>> : >>Thanks,
>> : >>Harini
>> : >>
>> : >>
>> : >>
>> : >>
>> : >>
>> : >>
>> : >>
>> : >>
>> : >---------------------------------------------------------------------
>> : >
>> : >
>> : >>To unsubscribe, e-mail:
>> : >>java-user-unsubscribe@lucene.apache.org
>> : >>For additional commands, e-mail:
>> : >>java-user-help@lucene.apache.org
>> : >>
>> : >>
>> : >>
>> : >>
>> : >
>> : >
>> : >
>> : >
>> : >___________________________________________________________
>> : >Yahoo! Model Search 2005 - Find the next catwalk superstars - 
>> http://uk.news.yahoo.com/hot/model-search/
>> : >
>> : >---------------------------------------------------------------------
>> : >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> : >For additional commands, e-mail: java-user-help@lucene.apache.org
>> : >
>> : >
>> : >
>> : >
>> :
>> :
>> : ---------------------------------------------------------------------
>> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> : For additional commands, e-mail: java-user-help@lucene.apache.org
>> :
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>  
>>
>------------------------------------------------------------------------
>
>package org.apache.lucene.search.highlight;
>/**
> * Copyright 2002-2004 The Apache Software Foundation
> *
> * Licensed under the Apache License, Version 2.0 (the "License");
> * you may not use this file except in compliance with the License.
> * You may obtain a copy of the License at
> *
> *     http://www.apache.org/licenses/LICENSE-2.0
> *
> * Unless required by applicable law or agreed to in writing, software
> * distributed under the License is distributed on an "AS IS" BASIS,
> * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> * See the License for the specific language governing permissions and
> * limitations under the License.
> */
>
>import java.io.IOException;
>import java.util.Collection;
>import java.util.HashSet;
>import java.util.Iterator;
>
>import org.apache.lucene.index.IndexReader;
>import org.apache.lucene.index.Term;
>import org.apache.lucene.search.BooleanClause;
>import org.apache.lucene.search.BooleanQuery;
>import org.apache.lucene.search.PhraseQuery;
>import org.apache.lucene.search.Query;
>import org.apache.lucene.search.TermQuery;
>import org.apache.lucene.search.spans.SpanNearQuery;
>
>/**
> * Utility class used to extract the terms used in a query, plus any weights.
> * This class will not find terms for MultiTermQuery, RangeQuery and PrefixQuery classes
> * so the caller must pass a rewritten query (see Query.rewrite) to obtain a list of
> * expanded terms.
> *
> */
>public final class QueryTermExtractor
>{
>
>	/**
>	 * Extracts all terms texts of a given Query into an array of WeightedTerms
>	 *
>	 * @param query      Query to extract term texts from
>	 * @return an array of the terms used in a query, plus their weights.
>	 */
>	public static final WeightedTerm[] getTerms(Query query)
>	{
>		return getTerms(query,false,"");
>	}
>
>	/**
>	 * Extracts all terms texts of a given Query into an array of WeightedTerms
>	 *
>	 * @param query      Query to extract term texts from
>	 * @param reader used to compute IDF which can be used to a) score selected fragments better
>	 * b) use graded highlights eg chaning intensity of font color
>	 * @param fieldName the field on which Inverse Document Frequency (IDF) calculations are based
>	 * @return an array of the terms used in a query, plus their weights.
>	 */
>	public static final WeightedTerm[] getIdfWeightedTerms(Query query, IndexReader reader, String fieldName)
>	{
>	    WeightedTerm[] terms=getTerms(query,false,fieldName);
>	    int totalNumDocs=reader.numDocs();
>	    for (int i = 0; i < terms.length; i++)
>        {
>	        try
>            {
>                int docFreq=reader.docFreq(new Term(fieldName,terms[i].term));
>                //IDF algorithm taken from DefaultSimilarity class
>                float idf=(float)(Math.log((float)totalNumDocs/(double)(docFreq+1)) + 1.0);
>                terms[i].weight*=idf;
>            }
>	        catch (IOException e)
>            {
>	            //ignore
>            }
>        }
>		return terms;
>	}
>
>	/**
>	 * Extracts all terms texts of a given Query into an array of WeightedTerms
>	 *
>	 * @param query      Query to extract term texts from
>	 * @param prohibited <code>true</code> to extract "prohibited" terms, too
>   * @return an array of the terms used in a query, plus their weights.
>   */
>	public static final WeightedTerm[] getTerms(Query query, boolean prohibited, String fieldName)
>	{
>		HashSet terms=new HashSet();
>		getTerms(query,terms,prohibited,fieldName);
>		return (WeightedTerm[]) terms.toArray(new WeightedTerm[0]);
>	}
>
>	private static final void getTerms(Query query, HashSet terms,boolean prohibited, String fieldName)
>	{
>		if (query instanceof BooleanQuery)
>			getTermsFromBooleanQuery((BooleanQuery) query, terms, prohibited, fieldName);
>		else
>			if (query instanceof PhraseQuery)
>				getTermsFromPhraseQuery((PhraseQuery) query, terms, fieldName);
>			else
>				if (query instanceof TermQuery)
>					getTermsFromTermQuery((TermQuery) query, terms, fieldName);
>				else
>		        if(query instanceof SpanNearQuery)
>		            getTermsFromSpanNearQuery((SpanNearQuery) query, terms, fieldName);
>	}
>
>	private static final void getTermsFromBooleanQuery(BooleanQuery query, HashSet terms, boolean prohibited, String fieldName)
>	{
>		BooleanClause[] queryClauses = query.getClauses();
>		int i;
>
>		for (i = 0; i < queryClauses.length; i++)
>		{
>			if (prohibited || !queryClauses[i].prohibited)
>				getTerms(queryClauses[i].query, terms, prohibited, fieldName);
>		}
>	}
>
>	private static final void getTermsFromPhraseQuery(PhraseQuery query, HashSet terms, String fieldName)
>	{
>		Term[] queryTerms = query.getTerms();
>		int i;
>		String field;
>
>		for (i = 0; i < queryTerms.length; i++)
>		{
>			if(fieldName.equals(""))
>				terms.add(new WeightedTerm(query.getBoost(),queryTerms[i].text()));
>			else {
>				field = queryTerms[i].field();
>				if(field.equals(fieldName))
>					terms.add(new WeightedTerm(query.getBoost(),queryTerms[i].text()));
>			}
>		}
>	}
>
>	private static final void getTermsFromTermQuery(TermQuery query, HashSet terms, String fieldName)
>	{
>		String field = query.getTerm().field();
>		if(fieldName.equals(""))
>			terms.add(new WeightedTerm(query.getBoost(),query.getTerm().text()));
>		else if(field.equals(fieldName)) {
>			terms.add(new WeightedTerm(query.getBoost(),query.getTerm().text()));
>		}
>	}
>
>    private static final void getTermsFromSpanNearQuery(SpanNearQuery query, HashSet terms, String fieldName){
>
>        Collection queryTerms = query.getTerms();
>
>        for(Iterator iterator = queryTerms.iterator(); iterator.hasNext();){
>
>            // break it out for debugging.
>
>            Term term = (Term) iterator.next();
>
>            String text = term.text();
>
>			String field = term.field();
>
>			if(fieldName.equals(""))
>				terms.add(new WeightedTerm(query.getBoost(), text));
>			else if(field.equals(fieldName)) {
>        	    terms.add(new WeightedTerm(query.getBoost(), text));
>			}
>
>        }
>
>    }
>
>}
>
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to control terms to be highlighted?

Posted by mark harwood <ma...@yahoo.co.uk>.

Hi Harini, 
I updated QueryTermsExtractor in Subversion last night
to support your requirement.

The JUnit test is also updated with a field-specific
example.

Cheers,
Mark


--- Harini Raghavan <ha...@insideview.com>
wrote:

> Hi Chris,
> 
> Can we pass a different query object for searching
> and a different one 
> to the highlighter? I am not sure of that.
> In any case,  based on Mark's suggestion I modified
> the 
> QueryTermsExtractor class and filtered the query 
> terms by the fieldName.
> Attached is the modified file.
> 
> Thanks,
> Harini
> 
> 
> 
> Chris Hostetter wrote:
> 
> >I don't know what your application is, and I have
> no experience with the
> >Highlighter code, so forgive me if this is a silly
> suggestion:
> >
> >It looks like you are building a query up
> programaticaly, which
> >contains some words to search on, and some other
> stuff that's mainly
> >being used to "filter" the results (i'll avoid my
> usual rant about
> >people underutilizing Filters).  So why not pass
> the Higherlighter just
> >the portion of the Query that you acctaully want to
> contribute to the
> >highlighting?  In this query...
> >
> >: >> +DocumentType:news
> >: >> +(CompanyId:10 CompanyId:20 CompanyId:30
> CompanyId:40)
> >: >> +FilingDate:[20041201 TO 20051201]
> >: >> +(Content:"cost saving" Content:"cost savings"
> >: >>Content:outsource
> >: >>Content:outsources Content:downsize
> >: >>Content:downsizes
> >: >>Content:restructuring Content:restructure)
> >
> >...just give the highlighter...
> >
> >    (Content:"cost saving" Content:"cost savings"
> >     Content:outsource
> >     Content:outsources Content:downsize
> >     Content:downsizes
> >     Content:restructuring Content:restructure)
> >
> >
> >: Date: Thu, 01 Dec 2005 10:38:41 +0530
> >: From: Harini Raghavan
> <ha...@insideview.com>
> >: Reply-To: java-user@lucene.apache.org
> >: To: java-user@lucene.apache.org
> >: Subject: Re: how to control terms to be
> highlighted?
> >:
> >: Hi Mark,
> >:
> >: It would be great if you can make this change and
> send the
> >: QueryTermsExtractor class. I am invoking the
> QueryScorer(Query)
> >: contructor. Should I use QueryScorer(Query query,
> IndexReader reader,
> >: String fieldName) instead for this to work?
> >:
> >: Thanks,
> >: Harini
> >:
> >: mark harwood wrote:
> >:
> >: >>>>Is there anyway to restrict the highlighter
> to
> >: >>>>
> >: >>>>
> >: >>highlight only the values
> >: >>mentioned for the field 'Content'?
> >: >>
> >: >>
> >: >
> >: >The problem lies in the QueryTermsExtractor
> class
> >: >which is typically used to provide the
> Highlighter
> >: >with the list of strings to identify in the
> text. It
> >: >currently has no filter for fieldname - you
> could add
> >: >this without too much effort.
> >: >
> >: >I could make this modification but it may change
> the
> >: >behaviour of existing applications - currently
> the
> >: >QueryTermsExtractor method that takes a
> fieldname only
> >: >uses that fieldname to derive IDF weightings,
> the
> >: >proposed change would also have the effect of
> >: >filtering out any query terms that weren't for
> this
> >: >field.
> >: >Would this change be a problem for anyone?
> >: >
> >: >Cheers,
> >: >Mark
> >: >
> >: >--- Harini Raghavan
> <ha...@insideview.com>
> >: >wrote:
> >: >
> >: >
> >: >
> >: >>Hi,
> >: >>
> >: >>I have a requirement to highlight search
> keywords in
> >: >>the results and
> >: >>display the matching fragment of the text with
> the
> >: >>results. I am using
> >: >>the Hits highlighting mentioned in Lucene in
> Action.
> >: >>
> >: >>Here is the search query(BooleanQuery) I am
> passing
> >: >>to the IndexSearcher
> >: >>and QueryScorer:
> >: >> +DocumentType:news
> >: >> +(CompanyId:10 CompanyId:20 CompanyId:30
> >: >>CompanyId:40)
> >: >> +FilingDate:[20041201 TO 20051201]
> >: >> +(Content:"cost saving" Content:"cost savings"
> >: >>Content:outsource
> >: >>Content:outsources Content:downsize
> >: >>Content:downsizes
> >: >>Content:restructuring Content:restructure)
> >: >>
> >: >>My requirement is to highlight only the
> keywords for
> >: >>'Content' field,
> >: >>but the highlighter api is also highlighting
> words
> >: >>like 'news', '10',
> >: >>'40' etc.
> >: >>Is there anyway to restrict the highlighter to
> >: >>highlight only the values
> >: >>mentioned for the field 'Content'?
> >: >>
> >: >>Thanks,
> >: >>Harini
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >:
>
>---------------------------------------------------------------------
> >: >
> >: >
> >: >>To unsubscribe, e-mail:
> >: >>java-user-unsubscribe@lucene.apache.org
> >: >>For additional commands, e-mail:
> >: >>java-user-help@lucene.apache.org
> >: >>
> >: >>
> >: >>
> >: >>
> >: >
> >: >
> >: >
> >: >
> >:
>
>___________________________________________________________
> >: >Yahoo! Model Search 2005 - Find the next catwalk
> superstars -
> http://uk.news.yahoo.com/hot/model-search/
> >: >
> >:
>
>---------------------------------------------------------------------
> >: >To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >: >For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >: >
> >: >
> >: >
> >: >
> >:
> >:
> >:
>
---------------------------------------------------------------------
> 
=== message truncated ===> package
org.apache.lucene.search.highlight;
> /**
>  * Copyright 2002-2004 The Apache Software
> Foundation
>  *
>  * Licensed under the Apache License, Version 2.0
> (the "License");
>  * you may not use this file except in compliance
> with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in
> writing, software
>  * distributed under the License is distributed on
> an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
> either express or implied.
>  * See the License for the specific language
> governing permissions and
>  * limitations under the License.
>  */
> 
> import java.io.IOException;
> import java.util.Collection;
> import java.util.HashSet;
> import java.util.Iterator;
> 
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.BooleanClause;
> import org.apache.lucene.search.BooleanQuery;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.search.spans.SpanNearQuery;
> 
> /**
>  * Utility class used to extract the terms used in a
> query, plus any weights.
>  * This class will not find terms for
> MultiTermQuery, RangeQuery and PrefixQuery classes
>  * so the caller must pass a rewritten query (see
> Query.rewrite) to obtain a list of
>  * expanded terms.
>  *
>  */
> public final class QueryTermExtractor
> {
> 
> 	/**
> 	 * Extracts all terms texts of a given Query into
> an array of WeightedTerms
> 	 *
> 	 * @param query      Query to extract term texts
> from
> 	 * @return an array of the terms used in a query,
> plus their weights.
> 	 */
> 	public static final WeightedTerm[] getTerms(Query
> query)
> 	{
> 		return getTerms(query,false,"");
> 	}
> 
> 	/**
> 	 * Extracts all terms texts of a given Query into
> an array of WeightedTerms
> 	 *
> 	 * @param query      Query to extract term texts
> from
> 	 * @param reader used to compute IDF which can be
> used to a) score selected fragments better
> 	 * b) use graded highlights eg chaning intensity of
> font color
> 	 * @param fieldName the field on which Inverse
> Document Frequency (IDF) calculations are based
> 	 * @return an array of the terms used in a query,
> plus their weights.
> 	 */
> 	public static final WeightedTerm[]
> getIdfWeightedTerms(Query query, IndexReader reader,
> String fieldName)
> 	{
> 	    WeightedTerm[]
> terms=getTerms(query,false,fieldName);
> 	    int totalNumDocs=reader.numDocs();
> 	    for (int i = 0; i < terms.length; i++)
>         {
> 	        try
>             {
>                 int docFreq=reader.docFreq(new
> Term(fieldName,terms[i].term));
>                 //IDF algorithm taken from
> DefaultSimilarity class
>                 float
>
idf=(float)(Math.log((float)totalNumDocs/(double)(docFreq+1))
> + 1.0);
>                 terms[i].weight*=idf;
>             }
> 	        catch (IOException e)
>             {
> 	            //ignore
>             }
>         }
> 		return terms;
> 	}
> 
> 	/**
> 	 * Extracts all terms texts of a given Query into
> an array of WeightedTerms
> 	 *
> 	 * @param query      Query to extract term texts
> from
> 	 * @param prohibited <code>true</code> to extract
> "prohibited" terms, too
>    * @return an array of the terms used in a query,
> plus their weights.
>    */
> 	public static final WeightedTerm[] getTerms(Query
> query, boolean prohibited, String fieldName)
> 	{
> 		HashSet terms=new HashSet();
> 		getTerms(query,terms,prohibited,fieldName);
> 		return (WeightedTerm[]) terms.toArray(new
> WeightedTerm[0]);
> 	}
> 
> 	private static final void getTerms(Query query,
> HashSet terms,boolean prohibited, String fieldName)
> 	{
> 		if (query instanceof BooleanQuery)
> 			getTermsFromBooleanQuery((BooleanQuery) query,
> terms, prohibited, fieldName);
> 		else
> 			if (query instanceof PhraseQuery)
> 				getTermsFromPhraseQuery((PhraseQuery) query,
> terms, fieldName);
> 			else
> 				if (query instanceof TermQuery)
> 					getTermsFromTermQuery((TermQuery) query, terms,
> fieldName);
> 				else
> 		        if(query instanceof SpanNearQuery)
> 		           
> getTermsFromSpanNearQuery((SpanNearQuery) query,
> terms, fieldName);
> 	}
> 
> 	private static final void
> getTermsFromBooleanQuery(BooleanQuery query, HashSet
> terms, boolean prohibited, String fieldName)
> 	{
> 		BooleanClause[] queryClauses = query.getClauses();
> 		int i;
> 
> 		for (i = 0; i < queryClauses.length; i++)
> 		{
> 			if (prohibited || !queryClauses[i].prohibited)
> 				getTerms(queryClauses[i].query, terms,
> prohibited, fieldName);
> 		}
> 	}
> 
> 	private static final void
> getTermsFromPhraseQuery(PhraseQuery query, HashSet
> terms, String fieldName)
> 	{
> 		Term[] queryTerms = query.getTerms();
> 		int i;
> 		String field;
> 
> 		for (i = 0; i < queryTerms.length; i++)
> 		{
> 			if(fieldName.equals(""))
> 				terms.add(new
>
WeightedTerm(query.getBoost(),queryTerms[i].text()));
> 			else {
> 				field = queryTerms[i].field();
> 				if(field.equals(fieldName))
> 					terms.add(new
>
WeightedTerm(query.getBoost(),queryTerms[i].text()));
> 			}
> 		}
> 	}
> 
> 	private static final void
> getTermsFromTermQuery(TermQuery query, HashSet
> terms, String fieldName)
> 	{
> 		String field = query.getTerm().field();
> 		if(fieldName.equals(""))
> 			terms.add(new
>
WeightedTerm(query.getBoost(),query.getTerm().text()));
> 		else if(field.equals(fieldName)) {
> 			terms.add(new
>
WeightedTerm(query.getBoost(),query.getTerm().text()));
> 		}
> 	}
> 
> 
=== message truncated ===>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
java-user-help@lucene.apache.org



		
___________________________________________________________ 
WIN ONE OF THREE YAHOO! VESPAS - Enter now! - http://uk.cars.yahoo.com/features/competitions/vespa.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to control terms to be highlighted?

Posted by Harini Raghavan <ha...@insideview.com>.

Hi Chris,

Can we pass a different query object for searching and a different one 
to the highlighter? I am not sure of that.
In any case,  based on Mark's suggestion I modified the 
QueryTermsExtractor class and filtered the query  terms by the fieldName.
Attached is the modified file.

Thanks,
Harini



Chris Hostetter wrote:

>I don't know what your application is, and I have no experience with the
>Highlighter code, so forgive me if this is a silly suggestion:
>
>It looks like you are building a query up programaticaly, which
>contains some words to search on, and some other stuff that's mainly
>being used to "filter" the results (i'll avoid my usual rant about
>people underutilizing Filters).  So why not pass the Higherlighter just
>the portion of the Query that you acctaully want to contribute to the
>highlighting?  In this query...
>
>: >> +DocumentType:news
>: >> +(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
>: >> +FilingDate:[20041201 TO 20051201]
>: >> +(Content:"cost saving" Content:"cost savings"
>: >>Content:outsource
>: >>Content:outsources Content:downsize
>: >>Content:downsizes
>: >>Content:restructuring Content:restructure)
>
>...just give the highlighter...
>
>    (Content:"cost saving" Content:"cost savings"
>     Content:outsource
>     Content:outsources Content:downsize
>     Content:downsizes
>     Content:restructuring Content:restructure)
>
>
>: Date: Thu, 01 Dec 2005 10:38:41 +0530
>: From: Harini Raghavan <ha...@insideview.com>
>: Reply-To: java-user@lucene.apache.org
>: To: java-user@lucene.apache.org
>: Subject: Re: how to control terms to be highlighted?
>:
>: Hi Mark,
>:
>: It would be great if you can make this change and send the
>: QueryTermsExtractor class. I am invoking the QueryScorer(Query)
>: contructor. Should I use QueryScorer(Query query, IndexReader reader,
>: String fieldName) instead for this to work?
>:
>: Thanks,
>: Harini
>:
>: mark harwood wrote:
>:
>: >>>>Is there anyway to restrict the highlighter to
>: >>>>
>: >>>>
>: >>highlight only the values
>: >>mentioned for the field 'Content'?
>: >>
>: >>
>: >
>: >The problem lies in the QueryTermsExtractor class
>: >which is typically used to provide the Highlighter
>: >with the list of strings to identify in the text. It
>: >currently has no filter for fieldname - you could add
>: >this without too much effort.
>: >
>: >I could make this modification but it may change the
>: >behaviour of existing applications - currently the
>: >QueryTermsExtractor method that takes a fieldname only
>: >uses that fieldname to derive IDF weightings, the
>: >proposed change would also have the effect of
>: >filtering out any query terms that weren't for this
>: >field.
>: >Would this change be a problem for anyone?
>: >
>: >Cheers,
>: >Mark
>: >
>: >--- Harini Raghavan <ha...@insideview.com>
>: >wrote:
>: >
>: >
>: >
>: >>Hi,
>: >>
>: >>I have a requirement to highlight search keywords in
>: >>the results and
>: >>display the matching fragment of the text with the
>: >>results. I am using
>: >>the Hits highlighting mentioned in Lucene in Action.
>: >>
>: >>Here is the search query(BooleanQuery) I am passing
>: >>to the IndexSearcher
>: >>and QueryScorer:
>: >> +DocumentType:news
>: >> +(CompanyId:10 CompanyId:20 CompanyId:30
>: >>CompanyId:40)
>: >> +FilingDate:[20041201 TO 20051201]
>: >> +(Content:"cost saving" Content:"cost savings"
>: >>Content:outsource
>: >>Content:outsources Content:downsize
>: >>Content:downsizes
>: >>Content:restructuring Content:restructure)
>: >>
>: >>My requirement is to highlight only the keywords for
>: >>'Content' field,
>: >>but the highlighter api is also highlighting words
>: >>like 'news', '10',
>: >>'40' etc.
>: >>Is there anyway to restrict the highlighter to
>: >>highlight only the values
>: >>mentioned for the field 'Content'?
>: >>
>: >>Thanks,
>: >>Harini
>: >>
>: >>
>: >>
>: >>
>: >>
>: >>
>: >>
>: >>
>: >---------------------------------------------------------------------
>: >
>: >
>: >>To unsubscribe, e-mail:
>: >>java-user-unsubscribe@lucene.apache.org
>: >>For additional commands, e-mail:
>: >>java-user-help@lucene.apache.org
>: >>
>: >>
>: >>
>: >>
>: >
>: >
>: >
>: >
>: >___________________________________________________________
>: >Yahoo! Model Search 2005 - Find the next catwalk superstars - http://uk.news.yahoo.com/hot/model-search/
>: >
>: >---------------------------------------------------------------------
>: >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>: >For additional commands, e-mail: java-user-help@lucene.apache.org
>: >
>: >
>: >
>: >
>:
>:
>: ---------------------------------------------------------------------
>: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>: For additional commands, e-mail: java-user-help@lucene.apache.org
>:
>
>
>
>-Hoss
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>

Re: how to control terms to be highlighted?

Posted by Chris Hostetter <ho...@fucit.org>.

I don't know what your application is, and I have no experience with the
Highlighter code, so forgive me if this is a silly suggestion:

It looks like you are building a query up programaticaly, which
contains some words to search on, and some other stuff that's mainly
being used to "filter" the results (i'll avoid my usual rant about
people underutilizing Filters).  So why not pass the Higherlighter just
the portion of the Query that you acctaully want to contribute to the
highlighting?  In this query...

: >> +DocumentType:news
: >> +(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
: >> +FilingDate:[20041201 TO 20051201]
: >> +(Content:"cost saving" Content:"cost savings"
: >>Content:outsource
: >>Content:outsources Content:downsize
: >>Content:downsizes
: >>Content:restructuring Content:restructure)

...just give the highlighter...

    (Content:"cost saving" Content:"cost savings"
     Content:outsource
     Content:outsources Content:downsize
     Content:downsizes
     Content:restructuring Content:restructure)


: Date: Thu, 01 Dec 2005 10:38:41 +0530
: From: Harini Raghavan <ha...@insideview.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: how to control terms to be highlighted?
:
: Hi Mark,
:
: It would be great if you can make this change and send the
: QueryTermsExtractor class. I am invoking the QueryScorer(Query)
: contructor. Should I use QueryScorer(Query query, IndexReader reader,
: String fieldName) instead for this to work?
:
: Thanks,
: Harini
:
: mark harwood wrote:
:
: >>>>Is there anyway to restrict the highlighter to
: >>>>
: >>>>
: >>highlight only the values
: >>mentioned for the field 'Content'?
: >>
: >>
: >
: >The problem lies in the QueryTermsExtractor class
: >which is typically used to provide the Highlighter
: >with the list of strings to identify in the text. It
: >currently has no filter for fieldname - you could add
: >this without too much effort.
: >
: >I could make this modification but it may change the
: >behaviour of existing applications - currently the
: >QueryTermsExtractor method that takes a fieldname only
: >uses that fieldname to derive IDF weightings, the
: >proposed change would also have the effect of
: >filtering out any query terms that weren't for this
: >field.
: >Would this change be a problem for anyone?
: >
: >Cheers,
: >Mark
: >
: >--- Harini Raghavan <ha...@insideview.com>
: >wrote:
: >
: >
: >
: >>Hi,
: >>
: >>I have a requirement to highlight search keywords in
: >>the results and
: >>display the matching fragment of the text with the
: >>results. I am using
: >>the Hits highlighting mentioned in Lucene in Action.
: >>
: >>Here is the search query(BooleanQuery) I am passing
: >>to the IndexSearcher
: >>and QueryScorer:
: >> +DocumentType:news
: >> +(CompanyId:10 CompanyId:20 CompanyId:30
: >>CompanyId:40)
: >> +FilingDate:[20041201 TO 20051201]
: >> +(Content:"cost saving" Content:"cost savings"
: >>Content:outsource
: >>Content:outsources Content:downsize
: >>Content:downsizes
: >>Content:restructuring Content:restructure)
: >>
: >>My requirement is to highlight only the keywords for
: >>'Content' field,
: >>but the highlighter api is also highlighting words
: >>like 'news', '10',
: >>'40' etc.
: >>Is there anyway to restrict the highlighter to
: >>highlight only the values
: >>mentioned for the field 'Content'?
: >>
: >>Thanks,
: >>Harini
: >>
: >>
: >>
: >>
: >>
: >>
: >>
: >>
: >---------------------------------------------------------------------
: >
: >
: >>To unsubscribe, e-mail:
: >>java-user-unsubscribe@lucene.apache.org
: >>For additional commands, e-mail:
: >>java-user-help@lucene.apache.org
: >>
: >>
: >>
: >>
: >
: >
: >
: >
: >___________________________________________________________
: >Yahoo! Model Search 2005 - Find the next catwalk superstars - http://uk.news.yahoo.com/hot/model-search/
: >
: >---------------------------------------------------------------------
: >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: >For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
: >
: >
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org