You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "deb.lucene" <de...@gmail.com> on 2011/10/28 18:13:16 UTC

multiple phrase search for topic

Hi Group,

I am indexing and searching a large corpus of news articles. The indexing
process is very straightforward, I am utilizing the standardAnalyzer and
analyzing the content of the news document.
**************************
document = new Document();
document.add(new Field("snum", snum, Field.Store.YES,Field.Index.NO));
document.add(new Field("content", conent,
Field.Store.NO,Field.Index.ANALYZED,Field.TermVector.YES));
indexWriter.addDocument(document);

where, "snum" is the serial number of the news article and "content" is the
actual text of the document.

******************************
So far so good. The searching process is little complex as I am doing a
multiple phrase searching. Let me explain the situation with an example.
Suppose I have to retrieve documents which belong to the category "Software
Technology" using phrase/query terms related to that topic. Also, I have
around 10k phrases which belong to this particular category (e.g. "data
recovery tool",....., "C++ language",...."Steve Jobs",....."Mac
Layer",...."Grid Computing"...etc.). My idea was to create separate phrase
query for each of these phrases and then add all of them to a boolean query.
Much like this,

****************************
PhraseQuery pQuery ;
BooleanQuery bQuery = new BooleanQuery ();
bQuery.setMaxClauseCount(10000);
       
for (Phrase phrase : allPhrases)
{
          String terms[] = phrase.split("\\s++");
          int words = terms.length ;
            
          pQuery = new PhraseQuery();
          for ( int j = 0 ; j < words ; j++)
           {
                 String word = terms[j].toLowerCase();
                 pQuery.add(new Term("content", word));
                
           }
           pQuery.setSlop(0);
           bQuery.add(pQuery,BooleanClause.Occur.SHOULD);
}
int numOfSugg = 2000 ;
TopDocs matches = isearcher.search(bQuery, numOfSugg)

********************************
Unfortunately when I am searching the news content with this approach the
searched results do not look very promising. A lot of top-ranked documents
are not the best candidates for the "Software Technology" topic, even though
they contain the phrases (not very frequent). My questions are :

1) is there anything wrong in this usage of the phrase/boolean query?
2) how I can guarantee to retrieve the most suitable news documents (i.e.
document which contains a lot of the related phrases) in the top searched
results? I utilized the BooleanClause.Occur.SHOULD feature (instead of the
MUST) because it is impossible to find a single document containing all of
the 10k phrases, but using the SHOULD feature I surmise the best results
will be which contains at least a few of the phrases.

thanks in advance,
--d


--
View this message in context: http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3461423.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: multiple phrase search for topic

Posted by "deb.lucene" <de...@gmail.com>.
Hi Ian,

The other question I had was about the quality the results (especially in
the top ranks). But then I utilized the "explain" functionality of Lucene
and observed how the tf/idf parameters are functioning. 

I would be interested in seeing any work which modified the "similarity"
function in Lucene.

Regards,
d

--
View this message in context: http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3474768.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: multiple phrase search for topic

Posted by Ian Lea <ia...@gmail.com>.
Nice not to have to worry about performance.  You say there is another
question, but not what it is.  The code you show looks like it should
do what you want.

For anything non-trivial I prefer to build the queries directly in
code rather than concatenating strings to be parsed, because I find it
hard to work out the quotes and brackets and what the result will be.
But your way is fine.


--
Ian.


On Mon, Oct 31, 2011 at 2:51 PM, deb.lucene <de...@gmail.com> wrote:
> thanks Ian for your response. This is a one-time offline program so am not
> bothered about the performance (i.e. speed etc.).
>
> one more question, there are some situations where I need to run a AND
> clause (i.e. more than one phrase, such as "Apple" AND "Steve Jobs"). My
> approach was something like :-
>
> **********************
> String searchString = "(" + phrase1 + ")" + " AND " + "(" + phrase2 + ")" ;
> QueryParser queryParser = new QueryParser(Version.LUCENE_33,"content", new
> StandardAnalyzer(Version.LUCENE_33));
>
> Query query = queryParser.parse(searchString);
> bQuery.add(query,BooleanClause.Occur.SHOULD);
>
> **********************
> thanks for the carrot2 pointer.
>
> -d
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3468005.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: multiple phrase search for topic

Posted by "deb.lucene" <de...@gmail.com>.
thanks Ian for your response. This is a one-time offline program so am not
bothered about the performance (i.e. speed etc.).

one more question, there are some situations where I need to run a AND
clause (i.e. more than one phrase, such as "Apple" AND "Steve Jobs"). My
approach was something like :-

**********************
String searchString = "(" + phrase1 + ")" + " AND " + "(" + phrase2 + ")" ;
QueryParser queryParser = new QueryParser(Version.LUCENE_33,"content", new
StandardAnalyzer(Version.LUCENE_33));

Query query = queryParser.parse(searchString);
bQuery.add(query,BooleanClause.Occur.SHOULD); 

**********************
thanks for the carrot2 pointer.

-d




--
View this message in context: http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3468005.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: multiple phrase search for topic

Posted by Ian Lea <ia...@gmail.com>.
Seems to me your approach should work, although I'd worry about performance.

> A lot of top-ranked documents are not the best candidates for the "Software Technology" topic, even
> though they contain the phrases (not very frequent)

Surely the docs that contain the phrases are going to be top of the
list?  In what way are others "better" than the ones ranked top?

Running queries with a large number of clauses on large indexes can be
slow.  I'd look into doing the categorisation at indexing time then
searching with a simple "category: Software Technology" clause.  Or
filter.

Projects such as Carrot2 or LingPipe may be worth a look.


--
Ian.


On Fri, Oct 28, 2011 at 5:13 PM, deb.lucene <de...@gmail.com> wrote:
> Hi Group,
>
> I am indexing and searching a large corpus of news articles. The indexing
> process is very straightforward, I am utilizing the standardAnalyzer and
> analyzing the content of the news document.
> **************************
> document = new Document();
> document.add(new Field("snum", snum, Field.Store.YES,Field.Index.NO));
> document.add(new Field("content", conent,
> Field.Store.NO,Field.Index.ANALYZED,Field.TermVector.YES));
> indexWriter.addDocument(document);
>
> where, "snum" is the serial number of the news article and "content" is the
> actual text of the document.
>
> ******************************
> So far so good. The searching process is little complex as I am doing a
> multiple phrase searching. Let me explain the situation with an example.
> Suppose I have to retrieve documents which belong to the category "Software
> Technology" using phrase/query terms related to that topic. Also, I have
> around 10k phrases which belong to this particular category (e.g. "data
> recovery tool",....., "C++ language",...."Steve Jobs",....."Mac
> Layer",...."Grid Computing"...etc.). My idea was to create separate phrase
> query for each of these phrases and then add all of them to a boolean query.
> Much like this,
>
> ****************************
> PhraseQuery pQuery ;
> BooleanQuery bQuery = new BooleanQuery ();
> bQuery.setMaxClauseCount(10000);
>
> for (Phrase phrase : allPhrases)
> {
>          String terms[] = phrase.split("\\s++");
>          int words = terms.length ;
>
>          pQuery = new PhraseQuery();
>          for ( int j = 0 ; j < words ; j++)
>           {
>                 String word = terms[j].toLowerCase();
>                 pQuery.add(new Term("content", word));
>
>           }
>           pQuery.setSlop(0);
>           bQuery.add(pQuery,BooleanClause.Occur.SHOULD);
> }
> int numOfSugg = 2000 ;
> TopDocs matches = isearcher.search(bQuery, numOfSugg)
>
> ********************************
> Unfortunately when I am searching the news content with this approach the
> searched results do not look very promising. A lot of top-ranked documents
> are not the best candidates for the "Software Technology" topic, even though
> they contain the phrases (not very frequent). My questions are :
>
> 1) is there anything wrong in this usage of the phrase/boolean query?
> 2) how I can guarantee to retrieve the most suitable news documents (i.e.
> document which contains a lot of the related phrases) in the top searched
> results? I utilized the BooleanClause.Occur.SHOULD feature (instead of the
> MUST) because it is impossible to find a single document containing all of
> the 10k phrases, but using the SHOULD feature I surmise the best results
> will be which contains at least a few of the phrases.
>
> thanks in advance,
> --d
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3461423.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org