You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by starz10de <fa...@yahoo.com> on 2010/11/04 08:53:55 UTC

High frequency term for the searched query

I need to find the most frequent terms that are appeared with a query. 

HighFreqTerms.java can be used only to obtain the high frequency terms in
the whole index. 

I need just to find the high frequency terms to the submitted query. 

What I do now is:

I search the index with the query and retrieve the relevant documents then
save those documents in a new folder then index them. At the end I use
HighFreqTerms.java in the new index so I can find the most frequent terms to
the query. However, this is very slow and need long time to run.

Any idea how I can do this task efficiently 


Thanks in advance

-- 
View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1839942.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: High frequency term for the searched query

Posted by starz10de <fa...@yahoo.com>.
Hi Mic,

I tried like this:

 String indexName = "path";
 IndexReader r = IndexReader.open(indexName);
 MoreLikeThis mlt = new MoreLikeThis(r);
. .
. .
. .
. .
 BooleanQuery result = (BooleanQuery) mlt.like(docNum);
        result.add(query, BooleanClause.Occur.MUST_NOT);
       
how I can print the result content to see if it contain the releated terms
to the query ?  

 I tried also:
    String []   wordlist=mlt.retrieveInterestingTerms(r);
but was some error "the method retrieveInterestingTerms(int) in the type
MoreLikeThis is not applicable for the arguments (IndexReader)"

thanks a lot
-- 
View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1852936.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: High frequency term for the searched query

Posted by Michael McCandless <lu...@mikemccandless.com>.
Looks like maybe if you use MoreLikeThis directly, you can call it's
retrieveInterestingTerms(Reader) method?

Or, MoreLikeThisQuery.rewrite will return a BooleanQuery whose clauses
are the interesting terms?

Mike

On Fri, Nov 5, 2010 at 11:00 AM, starz10de <fa...@yahoo.com> wrote:
>
> HI Mike,
>
> I implemented MoreLikeThis but I couldn't figure out where or how to print
> the related term to the given query. All what I got is the relevant
> documents to the query with their scores.
>
> Any idea how to get the related terms?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1848702.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: High frequency term for the searched query

Posted by starz10de <fa...@yahoo.com>.
HI Mike,

I implemented MoreLikeThis but I couldn't figure out where or how to print
the related term to the given query. All what I got is the relevant
documents to the query with their scores.

Any idea how to get the related terms?

-- 
View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1848702.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: High frequency term for the searched query

Posted by Michael McCandless <lu...@mikemccandless.com>.
Maybe MoreLikeThisQuery (under contrib/queries) will do what you want?

Mike

On Fri, Nov 5, 2010 at 3:33 AM, starz10de <fa...@yahoo.com> wrote:
>
> Hi,
>
> I need to expand the query with the most terms occurred with it in
> documents. For example:  the word credits, tax, withdraw have high appearing
> with Bank. So my query is “Bank” and the result should be ranked list of the
> most frequent terms with "Bank"
>
> I could do that as I explained but not in efficient way.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1846800.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: High frequency term for the searched query

Posted by starz10de <fa...@yahoo.com>.
Hi,

I need to expand the query with the most terms occurred with it in
documents. For example:  the word credits, tax, withdraw have high appearing
with Bank. So my query is “Bank” and the result should be ranked list of the
most frequent terms with "Bank"

I could do that as I explained but not in efficient way.

-- 
View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1846800.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: High frequency term for the searched query

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Can you give more details about what you want?  Perhaps with an example?
Do you want the number of documents containing the query term, the number of occurrences of the query term within a document, or the number of occurrences of the term in the entire index?

You can use an explain query to get information on the number of occurrences within each document and the number of documents within the index  searcher.explain(query, doc)

If you want the number of occurrences of the term in the entire index, you can use 
org/apache/lucene/misc/GetTermInfo.java. You can give it a term and it will look up the total number of documents containing the term and the total number of occurrences of the term in the index.  


http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/GetTermInfo.java?revision=957522&view=markup

Tom

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-----Original Message-----
From: starz10de [mailto:farag_ahmed@yahoo.com] 
Sent: Thursday, November 04, 2010 3:54 AM
To: java-user@lucene.apache.org
Subject: High frequency term for the searched query


I need to find the most frequent terms that are appeared with a query. 

HighFreqTerms.java can be used only to obtain the high frequency terms in
the whole index. 

I need just to find the high frequency terms to the submitted query. 

What I do now is:

I search the index with the query and retrieve the relevant documents then
save those documents in a new folder then index them. At the end I use
HighFreqTerms.java in the new index so I can find the most frequent terms to
the query. However, this is very slow and need long time to run.

Any idea how I can do this task efficiently 


Thanks in advance

-- 
View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1839942.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: High frequency term for the searched query

Posted by starz10de <fa...@yahoo.com>.
Hi,

I did as it is explained in the website:

 final Set<Term> terms = new HashSet<Term>();  
       
        query = searcher.rewrite(query); 
         query.extractTerms(terms); 
         
        for(Term t : terms){ 
            int frequency = searcher.docFreq(t); 
        } 



however I can't understand why this error appeared:

"the method extractterms(Set<Term>) is undefined for the type Query"




any idea

-- 
View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1847012.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: High frequency term for the searched query

Posted by Uwe Schindler <uw...@thetaphi.de>.
It's there:

http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Query.
html#extractTerms(java.util.Set)


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: starz10de [mailto:farag_ahmed@yahoo.com]
> Sent: Friday, November 05, 2010 8:28 AM
> To: java-user@lucene.apache.org
> Subject: Re: High frequency term for the searched query
> 
> 
> HI Chris,
> 
> I tried your solution and got one problem "the method
> extractterms(Set<Term>) is undefined for the type Query"
> 
> 
> this is the ocde:
> 
> Query query = QueryParser.parse(line, "contents", analyzer);
> 	//System.out.println("Searching for: " +
query.toString("contents"));
> 
> 	Hits hits = searcher.search(query);
> 
>        final Set<Term> terms = new HashSet<Term>();
> 
>         query = searcher.rewrite(query);
> 
> // the problem in this line
>  query.extractTerms(terms);
> 
>         for(Term t : terms){
>             int frequency = searcher.docFreq(t);
>         }
> 
> Thanks in advance
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/High-
> frequency-term-for-the-searched-query-tp1839942p1846781.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: High frequency term for the searched query

Posted by starz10de <fa...@yahoo.com>.
HI Chris,

I tried your solution and got one problem "the method
extractterms(Set<Term>) is undefined for the type Query"


this is the ocde:

Query query = QueryParser.parse(line, "contents", analyzer);
	//System.out.println("Searching for: " + query.toString("contents"));

	Hits hits = searcher.search(query);
	
       final Set<Term> terms = new HashSet<Term>();  
       
        query = searcher.rewrite(query); 
        
// the problem in this line
 query.extractTerms(terms); 
        
        for(Term t : terms){ 
            int frequency = searcher.docFreq(t); 
        } 

Thanks in advance

-- 
View this message in context: http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1846781.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: High frequency term for the searched query

Posted by Chris Lu <ch...@gmail.com>.
After you get the query object, you can use IndexSearcher's function 
docFreq(), like this

final Set<Term> terms = new HashSet<Term>();
query = searcher.rewrite(query);
query.extractTerms(terms);
for(Term t : terms){
    int frequency = searcher.docFreq(t);
}

-- 
--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 11/4/2010 12:53 AM, starz10de wrote:

> I need to find the most frequent terms that are appeared with a query.
>
> HighFreqTerms.java can be used only to obtain the high frequency terms in
> the whole index.
>
> I need just to find the high frequency terms to the submitted query.
>
> What I do now is:
>
> I search the index with the query and retrieve the relevant documents then
> save those documents in a new folder then index them. At the end I use
> HighFreqTerms.java in the new index so I can find the most frequent terms to
> the query. However, this is very slow and need long time to run.
>
> Any idea how I can do this task efficiently
>
>
> Thanks in advance
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: High frequency term for the searched query

Posted by Chris Lu <ch...@gmail.com>.
After you get the query object, you can use IndexSearcher's function 
docFreq(), like this

final Set<Term> terms = new HashSet<Term>();
query = searcher.rewrite(query);
query.extractTerms(terms);
for(Term t : terms){
    int frequency = irs.getSearcher().docFreq(t);
}

-- 
--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 11/4/2010 12:53 AM, starz10de wrote:

> I need to find the most frequent terms that are appeared with a query.
>
> HighFreqTerms.java can be used only to obtain the high frequency terms in
> the whole index.
>
> I need just to find the high frequency terms to the submitted query.
>
> What I do now is:
>
> I search the index with the query and retrieve the relevant documents then
> save those documents in a new folder then index them. At the end I use
> HighFreqTerms.java in the new index so I can find the most frequent terms to
> the query. However, this is very slow and need long time to run.
>
> Any idea how I can do this task efficiently
>
>
> Thanks in advance
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: High frequency term for the searched query

Posted by Seth Rosen <se...@architexa.com>.
You might want to take a look at this tutorial on how Lucene calculates
Scoring [1]. If all you are interested in is the term frequency and you want
to ignore other calculations you can override the others and have them
return 1.

Hope this helps!
Seth Rosen
seth@architexa.com
www.architexa.com



[1] http://www.lucenetutorial.com/advanced-topics/scoring.html

On Thu, Nov 4, 2010 at 3:53 AM, starz10de <fa...@yahoo.com> wrote:

>
> I need to find the most frequent terms that are appeared with a query.
>
> HighFreqTerms.java can be used only to obtain the high frequency terms in
> the whole index.
>
> I need just to find the high frequency terms to the submitted query.
>
> What I do now is:
>
> I search the index with the query and retrieve the relevant documents then
> save those documents in a new folder then index them. At the end I use
> HighFreqTerms.java in the new index so I can find the most frequent terms
> to
> the query. However, this is very slow and need long time to run.
>
> Any idea how I can do this task efficiently
>
>
> Thanks in advance
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/High-frequency-term-for-the-searched-query-tp1839942p1839942.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>