You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Deb Lucene <de...@gmail.com> on 2012/03/14 16:32:39 UTC

Multi field search with values

Hi Group,

I am working on a Lucene search solution for multiple fields. So far, if
the fields are of string type I am having no difficulties in retrieving
using the MultiFieldQueryParser. For example, my indexing and searching
logic look like this -

indexing
- I am indexing a corpus on the content of the documents and some keywords
of the documents.

**********************************************
String doc = getText(id) ;
List<String> keywords = getKeywords(doc);
 document.add(new Field("content", doc, Field.Store.NO,Field.Index.ANALYZED,
Field.TermVector.YES));
for ( String keyword : keywords )
 {
    document.add(new Field("keyword", keyword, Field.Store.NO,
Field.Index.ANALYZED, Field.TermVector.YES));
 }
*********************************************
I am searching over the indexes using some query text and predefined
keywords
searching :
********************************************
String queryText = getQuery();
String keyword = getKeyword();
 BooleanClause.Occur[] flags =
{BooleanClause.Occur.SHOULD,BooleanClause.Occur.SHOULD};
 Query query = MultiFieldQueryParser.parse(Version.LUCENE_33, new String[]
{queryText, keyword},
                 new String[]{"content","keywords"}, flags, stAnalyzer);
[stAnalyzer is the standard analyzer]

 TopDocs hits = isearcher.search(query, 20);

********************************************

This code is working fine. But now suppose I add one more field (a
"threshold" set on some prior calculation) which is of numeric type.
NumericField field = new NumericField("threshold") ;
document.add(field.setDoubleValue(threhold));

Now can I search over multiple fields using the "string" type (i.e. content
and keywords) with the "double" type (i.e. the threshold)?
I am particularly looking for a query such as -
query - "some content" and "some keywords" and threshold > 0.5.

I surmise I need to use the "numeric field search" technique but not sure
how to add the functionality in MultiFieldQueryParser.

Thanks in advance,
--d

Re: Multi field search with values

Posted by Deb Lucene <de...@gmail.com>.
Hi group,

Is there any  way to index a document based on a key value (key = text,
value = double) pair? For example, we have a situation where -

document 1
IBM - 0.5
Google - 0.9
Apple - 0.3


document 2
IBM - 0.6
Google - 0.1
Apple - 0.4

now we need to search using two fields, the name (e.g. "IBM", "Apple") and
the score ( > 0.5 etc). A typical search query would be - "name == "IBM" &
value > 0.5 . Previously we have done experiments with MFQP and Numeric
Field Query - but here we need to link the fields.

Thanks in advance.
--d

RE: can't find common words -- using Lucene 3.4.0

Posted by Ilya Zavorin <iz...@caci.com>.
Steve,

I had to pull different pieces of the code below from different places in my system, but here what I do:

		Analyzer anIndx = new StandardAnalyzer(Version.LUCENE_34);
		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, anIndx);
		if (create == true)
		{
			iwc.setOpenMode(OpenMode.CREATE);
		}
		else
		{
			iwc.setOpenMode(OpenMode.APPEND);
		}    
		Directory dir = FSDirectory.open(new File(fPath));
		IndexWriter writer = new IndexWriter(dir, iwc);

Anything suspicious here?

Thanks


Ilya Zavorin


-----Original Message-----
From: Steven A Rowe [mailto:sarowe@syr.edu] 
Sent: Monday, March 26, 2012 1:48 PM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote:
> I am not seeing anything suspicious. Here's what I see in the HEX:
>
> "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65
> (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48

I agree, standard DOS/Windows line endings.

> I am pretty sure I am using the std analyzer

Interesting.  I'm quite sure something else is going on besides StandardAnalyzer, since StandardAnalyzer (more specifically, StandardTokenizer) always breaks tokens on whitespace, and excludes punctuation at the end of tokens.  In case you're interested, the "standard" to which StandardTokenizer (v3.1 - v3.5) conforms is the Word Boundaries rules from Unicode 6.0.0 standard annex #29 aka UAX#29: <http://www.unicode.org/reports/tr29/tr29-17.html#Word_Boundaries>.

Can you share the code where you construct your analyzer and IndexWriterConfig?

> Here's how I add a doc to the index (oc is String containing the whole document):
>
> doc.add(new Field("contents", 
> 		oc, 
> 		Field.Store.YES,
> 		Field.Index.ANALYZED, 
> 		Field.TermVector.WITH_POSITIONS_OFFSETS));
>
> Can this affect the indexing?

The way you add the Field looks fine.

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: can't find common words -- using Lucene 3.4.0

Posted by Steven A Rowe <sa...@syr.edu>.
On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote:
> I am not seeing anything suspicious. Here's what I see in the HEX:
>
> "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65
> (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48

I agree, standard DOS/Windows line endings.

> I am pretty sure I am using the std analyzer

Interesting.  I'm quite sure something else is going on besides StandardAnalyzer, since StandardAnalyzer (more specifically, StandardTokenizer) always breaks tokens on whitespace, and excludes punctuation at the end of tokens.  In case you're interested, the "standard" to which StandardTokenizer (v3.1 - v3.5) conforms is the Word Boundaries rules from Unicode 6.0.0 standard annex #29 aka UAX#29: <http://www.unicode.org/reports/tr29/tr29-17.html#Word_Boundaries>.

Can you share the code where you construct your analyzer and IndexWriterConfig?

> Here's how I add a doc to the index (oc is String containing the whole document):
>
> doc.add(new Field("contents", 
> 		oc, 
> 		Field.Store.YES,
> 		Field.Index.ANALYZED, 
> 		Field.TermVector.WITH_POSITIONS_OFFSETS));
>
> Can this affect the indexing?

The way you add the Field looks fine.

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: can't find common words -- using Lucene 3.4.0

Posted by Ilya Zavorin <iz...@caci.com>.
Steve,

I am not seeing anything suspicious. Here's what I see in the HEX:

"n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65 (n-.-CR-LF-CR-LF-e)
"e.H" from "sentence.He": 65-2E-0D-0A-48

I am pretty sure I am using the std analyzer

Here's how I add a doc to the index (oc is String containing the whole document):

doc.add(new Field("contents", 
		oc, 
		Field.Store.YES,
		Field.Index.ANALYZED, 
		Field.TermVector.WITH_POSITIONS_OFFSETS));

Can this affect the indexing?

Thanks,

Ilya









-----Original Message-----
From: Steven A Rowe [mailto:sarowe@syr.edu] 
Sent: Monday, March 26, 2012 11:41 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Ilya,

StandardAnalyzer treats all forms of newline as whitespace, and doesn't join tokens across whitespace.  Can you look at your original text using a hex editor (or something like it, e.g. Unix "od")?  Check which character is actually inbetween "electricity" and "this", and "pain." and "electricity" in the original text.

Are you sure that these files were analyzed with StandardAnalyzer, and not some other language-specific analyzer, as a result of language misidentification?

Steve

-----Original Message-----
From: Ilya Zavorin [mailto:izavorin@caci.com] 
Sent: Monday, March 26, 2012 11:21 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Steve,

Thanks much for the link: very useful!

I looked at the index and found that it contains terms like

electricitythis -- from Doc 3
pain.electricity -- from Doc 1

sentence.he -- from Doc 1

It appears that there is some sort of issue with handling end-of-lines. What do I need to change at index time for this to work properly?


Not sure whether this is relevant, but the text files has been saved as UTF8 even though they are ASCII. I need to handle foreign text so I assume all files that I index are UTF8.

I am using the standard analyzer for English text and other contributed analyzers for respective foreign texts


Thanks,

Ilya


-----Original Message-----
From: Steven A Rowe [mailto:sarowe@syr.edu] 
Sent: Monday, March 26, 2012 10:59 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Hi Ilya,

What analyzers are you using at index-time and query-time?

My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match.

Luke can tell you what's in your index: <http://code.google.com/p/luke/>

Steve

-----Original Message-----
From: Ilya Zavorin [mailto:izavorin@caci.com] 
Sent: Monday, March 26, 2012 10:11 AM
To: java-user@lucene.apache.org
Subject: can't find common words -- using Lucene 3.4.0 

I am writing a Lucene based indexing-search app and testing it using some simple docs and querries. I have 3 simples docs that are shown at the bottom of the this email between pairs of "==================="s and about a dozen terms. One of them is "electricity". As you can see, it appears in all three docs. However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is "sentence". I have a bunch of other querries that only appear in one of the three docs and these are all found correctly. 

Is this an indication that I have either set parameers incorrectly when indexing or set up the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = "sentence";
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===================
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an NFL career.
(US Presswire)
Inside the Baltimore Ravens' locker room the air is alive. Players argue about a bean-bag toss game they play after practices, then mock a teammate who has inexplicably decided to do an interview naked. Music thumps. Giant men laugh, and their laughter rattles off cinder block walls in the symphony of a football team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to leave. It's a strange image, loaded with contrasts. He doesn't belong here, not with these men, many of whom are almost 10 years younger than him. And yet he feels very much at home. He isn't the star on this team, which is two wins from the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive bowling ball of a man who in the spirit of running backs relishes the chance to run the ball 25 times a game. Williams is an afterthought, a backup who has carried the ball more than 12 times in only one game this season. Often he might have the ball in his hands on only four or five plays, and this is fine with him. In fact he prefers it. His body has absorbed enough beatings for one lifetime. Let someone else get the pain.

electricity


===================

Doc 2:
===================
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed every day by countless thousands around the globe and yet I have never heard even one remotely legitimate answer. How much wood would a woodchuck chuck if a woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood. Next you'll be wanting to know why she sells seashells by the seashore.

common term is electricity


===================

Doc 3:
===================
CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the right to hold the earliest presidential primary, fending off bigger states that claimed that the small New England state was too white to represent the nation's diverse population. Sentence is here.
In its defense, New Hampshire jokingly brags that its voters won't pick a presidential candidate until they've met at least three times face-to-face _ rather than seeing the person in television ads or at large events typical of bigger states. New Hampshire voters expect to shake hands with candidates at coffees that supporters host in their homes or at backyard barbecues.
That tradition paid off in 1976 for a little-known peanut farmer and former Georgia governor. Jimmy Carter won in New Hampshire and went on to become president.

word Hampshire by itself

this state has electricity

This is a state in the United states of America. Here is one term: United America. And Here's another one: States america. And here's yet another == UNITED STATES! Here we are dropping the middle stopword: United States		 America. Finally, we get one word: united. Then the second one: STates. Then the final one: America.

===================


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: can't find common words -- using Lucene 3.4.0

Posted by Steven A Rowe <sa...@syr.edu>.
Ilya,

StandardAnalyzer treats all forms of newline as whitespace, and doesn't join tokens across whitespace.  Can you look at your original text using a hex editor (or something like it, e.g. Unix "od")?  Check which character is actually inbetween "electricity" and "this", and "pain." and "electricity" in the original text.

Are you sure that these files were analyzed with StandardAnalyzer, and not some other language-specific analyzer, as a result of language misidentification?

Steve

-----Original Message-----
From: Ilya Zavorin [mailto:izavorin@caci.com] 
Sent: Monday, March 26, 2012 11:21 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Steve,

Thanks much for the link: very useful!

I looked at the index and found that it contains terms like

electricitythis -- from Doc 3
pain.electricity -- from Doc 1

sentence.he -- from Doc 1

It appears that there is some sort of issue with handling end-of-lines. What do I need to change at index time for this to work properly?


Not sure whether this is relevant, but the text files has been saved as UTF8 even though they are ASCII. I need to handle foreign text so I assume all files that I index are UTF8.

I am using the standard analyzer for English text and other contributed analyzers for respective foreign texts


Thanks,

Ilya


-----Original Message-----
From: Steven A Rowe [mailto:sarowe@syr.edu] 
Sent: Monday, March 26, 2012 10:59 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Hi Ilya,

What analyzers are you using at index-time and query-time?

My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match.

Luke can tell you what's in your index: <http://code.google.com/p/luke/>

Steve

-----Original Message-----
From: Ilya Zavorin [mailto:izavorin@caci.com] 
Sent: Monday, March 26, 2012 10:11 AM
To: java-user@lucene.apache.org
Subject: can't find common words -- using Lucene 3.4.0 

I am writing a Lucene based indexing-search app and testing it using some simple docs and querries. I have 3 simples docs that are shown at the bottom of the this email between pairs of "==================="s and about a dozen terms. One of them is "electricity". As you can see, it appears in all three docs. However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is "sentence". I have a bunch of other querries that only appear in one of the three docs and these are all found correctly. 

Is this an indication that I have either set parameers incorrectly when indexing or set up the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = "sentence";
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===================
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an NFL career.
(US Presswire)
Inside the Baltimore Ravens' locker room the air is alive. Players argue about a bean-bag toss game they play after practices, then mock a teammate who has inexplicably decided to do an interview naked. Music thumps. Giant men laugh, and their laughter rattles off cinder block walls in the symphony of a football team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to leave. It's a strange image, loaded with contrasts. He doesn't belong here, not with these men, many of whom are almost 10 years younger than him. And yet he feels very much at home. He isn't the star on this team, which is two wins from the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive bowling ball of a man who in the spirit of running backs relishes the chance to run the ball 25 times a game. Williams is an afterthought, a backup who has carried the ball more than 12 times in only one game this season. Often he might have the ball in his hands on only four or five plays, and this is fine with him. In fact he prefers it. His body has absorbed enough beatings for one lifetime. Let someone else get the pain.

electricity


===================

Doc 2:
===================
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed every day by countless thousands around the globe and yet I have never heard even one remotely legitimate answer. How much wood would a woodchuck chuck if a woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood. Next you'll be wanting to know why she sells seashells by the seashore.

common term is electricity


===================

Doc 3:
===================
CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the right to hold the earliest presidential primary, fending off bigger states that claimed that the small New England state was too white to represent the nation's diverse population. Sentence is here.
In its defense, New Hampshire jokingly brags that its voters won't pick a presidential candidate until they've met at least three times face-to-face _ rather than seeing the person in television ads or at large events typical of bigger states. New Hampshire voters expect to shake hands with candidates at coffees that supporters host in their homes or at backyard barbecues.
That tradition paid off in 1976 for a little-known peanut farmer and former Georgia governor. Jimmy Carter won in New Hampshire and went on to become president.

word Hampshire by itself

this state has electricity

This is a state in the United states of America. Here is one term: United America. And Here's another one: States america. And here's yet another == UNITED STATES! Here we are dropping the middle stopword: United States		 America. Finally, we get one word: united. Then the second one: STates. Then the final one: America.

===================


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: can't find common words -- using Lucene 3.4.0

Posted by Ilya Zavorin <iz...@caci.com>.
Steve,

Thanks much for the link: very useful!

I looked at the index and found that it contains terms like

electricitythis -- from Doc 3
pain.electricity -- from Doc 1

sentence.he -- from Doc 1

It appears that there is some sort of issue with handling end-of-lines. What do I need to change at index time for this to work properly?


Not sure whether this is relevant, but the text files has been saved as UTF8 even though they are ASCII. I need to handle foreign text so I assume all files that I index are UTF8.

I am using the standard analyzer for English text and other contributed analyzers for respective foreign texts


Thanks,

Ilya







-----Original Message-----
From: Steven A Rowe [mailto:sarowe@syr.edu] 
Sent: Monday, March 26, 2012 10:59 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Hi Ilya,

What analyzers are you using at index-time and query-time?

My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match.

Luke can tell you what's in your index: <http://code.google.com/p/luke/>

Steve

-----Original Message-----
From: Ilya Zavorin [mailto:izavorin@caci.com] 
Sent: Monday, March 26, 2012 10:11 AM
To: java-user@lucene.apache.org
Subject: can't find common words -- using Lucene 3.4.0 

I am writing a Lucene based indexing-search app and testing it using some simple docs and querries. I have 3 simples docs that are shown at the bottom of the this email between pairs of "==================="s and about a dozen terms. One of them is "electricity". As you can see, it appears in all three docs. However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is "sentence". I have a bunch of other querries that only appear in one of the three docs and these are all found correctly. 

Is this an indication that I have either set parameers incorrectly when indexing or set up the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = "sentence";
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===================
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an NFL career.
(US Presswire)
Inside the Baltimore Ravens' locker room the air is alive. Players argue about a bean-bag toss game they play after practices, then mock a teammate who has inexplicably decided to do an interview naked. Music thumps. Giant men laugh, and their laughter rattles off cinder block walls in the symphony of a football team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to leave. It's a strange image, loaded with contrasts. He doesn't belong here, not with these men, many of whom are almost 10 years younger than him. And yet he feels very much at home. He isn't the star on this team, which is two wins from the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive bowling ball of a man who in the spirit of running backs relishes the chance to run the ball 25 times a game. Williams is an afterthought, a backup who has carried the ball more than 12 times in only one game this season. Often he might have the ball in his hands on only four or five plays, and this is fine with him. In fact he prefers it. His body has absorbed enough beatings for one lifetime. Let someone else get the pain.

electricity


===================

Doc 2:
===================
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed every day by countless thousands around the globe and yet I have never heard even one remotely legitimate answer. How much wood would a woodchuck chuck if a woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood. Next you'll be wanting to know why she sells seashells by the seashore.

common term is electricity


===================

Doc 3:
===================
CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the right to hold the earliest presidential primary, fending off bigger states that claimed that the small New England state was too white to represent the nation's diverse population. Sentence is here.
In its defense, New Hampshire jokingly brags that its voters won't pick a presidential candidate until they've met at least three times face-to-face _ rather than seeing the person in television ads or at large events typical of bigger states. New Hampshire voters expect to shake hands with candidates at coffees that supporters host in their homes or at backyard barbecues.
That tradition paid off in 1976 for a little-known peanut farmer and former Georgia governor. Jimmy Carter won in New Hampshire and went on to become president.

word Hampshire by itself

this state has electricity

This is a state in the United states of America. Here is one term: United America. And Here's another one: States america. And here's yet another == UNITED STATES! Here we are dropping the middle stopword: United States		 America. Finally, we get one word: united. Then the second one: STates. Then the final one: America.

===================


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: can't find common words -- using Lucene 3.4.0

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Ilya,

What analyzers are you using at index-time and query-time?

My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match.

Luke can tell you what's in your index: <http://code.google.com/p/luke/>

Steve

-----Original Message-----
From: Ilya Zavorin [mailto:izavorin@caci.com] 
Sent: Monday, March 26, 2012 10:11 AM
To: java-user@lucene.apache.org
Subject: can't find common words -- using Lucene 3.4.0 

I am writing a Lucene based indexing-search app and testing it using some simple docs and querries. I have 3 simples docs that are shown at the bottom of the this email between pairs of "==================="s and about a dozen terms. One of them is "electricity". As you can see, it appears in all three docs. However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is "sentence". I have a bunch of other querries that only appear in one of the three docs and these are all found correctly. 

Is this an indication that I have either set parameers incorrectly when indexing or set up the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = "sentence";
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===================
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an NFL career.
(US Presswire)
Inside the Baltimore Ravens' locker room the air is alive. Players argue about a bean-bag toss game they play after practices, then mock a teammate who has inexplicably decided to do an interview naked. Music thumps. Giant men laugh, and their laughter rattles off cinder block walls in the symphony of a football team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to leave. It's a strange image, loaded with contrasts. He doesn't belong here, not with these men, many of whom are almost 10 years younger than him. And yet he feels very much at home. He isn't the star on this team, which is two wins from the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive bowling ball of a man who in the spirit of running backs relishes the chance to run the ball 25 times a game. Williams is an afterthought, a backup who has carried the ball more than 12 times in only one game this season. Often he might have the ball in his hands on only four or five plays, and this is fine with him. In fact he prefers it. His body has absorbed enough beatings for one lifetime. Let someone else get the pain.

electricity


===================

Doc 2:
===================
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed every day by countless thousands around the globe and yet I have never heard even one remotely legitimate answer. How much wood would a woodchuck chuck if a woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood. Next you'll be wanting to know why she sells seashells by the seashore.

common term is electricity


===================

Doc 3:
===================
CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the right to hold the earliest presidential primary, fending off bigger states that claimed that the small New England state was too white to represent the nation's diverse population. Sentence is here.
In its defense, New Hampshire jokingly brags that its voters won't pick a presidential candidate until they've met at least three times face-to-face _ rather than seeing the person in television ads or at large events typical of bigger states. New Hampshire voters expect to shake hands with candidates at coffees that supporters host in their homes or at backyard barbecues.
That tradition paid off in 1976 for a little-known peanut farmer and former Georgia governor. Jimmy Carter won in New Hampshire and went on to become president.

word Hampshire by itself

this state has electricity

This is a state in the United states of America. Here is one term: United America. And Here's another one: States america. And here's yet another == UNITED STATES! Here we are dropping the middle stopword: United States		 America. Finally, we get one word: united. Then the second one: STates. Then the final one: America.

===================


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


can't find common words -- using Lucene 3.4.0

Posted by Ilya Zavorin <iz...@caci.com>.
I am writing a Lucene based indexing-search app and testing it using some simple docs and querries. I have 3 simples docs that are shown at the bottom of the this email between pairs of "==================="s and about a dozen terms. One of them is "electricity". As you can see, it appears in all three docs. However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is "sentence". I have a bunch of other querries that only appear in one of the three docs and these are all found correctly. 

Is this an indication that I have either set parameers incorrectly when indexing or set up the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = "sentence";
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===================
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an NFL career.
(US Presswire) 
Inside the Baltimore Ravens' locker room the air is alive. Players argue about a bean-bag toss game they play after practices, then mock a teammate who has inexplicably decided to do an interview naked. Music thumps. Giant men laugh, and their laughter rattles off cinder block walls in the symphony of a football team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to leave. It's a strange image, loaded with contrasts. He doesn't belong here, not with these men, many of whom are almost 10 years younger than him. And yet he feels very much at home. He isn't the star on this team, which is two wins from the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive bowling ball of a man who in the spirit of running backs relishes the chance to run the ball 25 times a game. Williams is an afterthought, a backup who has carried the ball more than 12 times in only one game this season. Often he might have the ball in his hands on only four or five plays, and this is fine with him. In fact he prefers it. His body has absorbed enough beatings for one lifetime. Let someone else get the pain.

electricity


===================

Doc 2:
===================
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed every day by countless thousands around the globe and yet I have never heard even one remotely legitimate answer. How much wood would a woodchuck chuck if a woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood. Next you'll be wanting to know why she sells seashells by the seashore.

common term is electricity


===================

Doc 3:
===================
CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the right to hold the earliest presidential primary, fending off bigger states that claimed that the small New England state was too white to represent the nation's diverse population. Sentence is here.
In its defense, New Hampshire jokingly brags that its voters won't pick a presidential candidate until they've met at least three times face-to-face _ rather than seeing the person in television ads or at large events typical of bigger states. New Hampshire voters expect to shake hands with candidates at coffees that supporters host in their homes or at backyard barbecues.
That tradition paid off in 1976 for a little-known peanut farmer and former Georgia governor. Jimmy Carter won in New Hampshire and went on to become president.

word Hampshire by itself

this state has electricity

This is a state in the United states of America. Here is one term: United America. And Here's another one: States america. And here's yet another == UNITED STATES! Here we are dropping the middle stopword: United States		 America. Finally, we get one word: united. Then the second one: STates. Then the final one: America.

===================


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi field search with values

Posted by Deb Lucene <de...@gmail.com>.
Hi Ian,

thanks a lot for your idea. Yes, it is working now.
thanks again

--d

On Wed, Mar 14, 2012 at 11:52 AM, Ian Lea <ia...@gmail.com> wrote:

> It the keywords are already in the document body (field "content") I
> don't see what you gain by indexing them separately and using MFQP.
> But that isn't what you are asking.  To add a threshold to the query
> do something like this:
>
> BooleanQuery bq = new BooleanQuery();
> Query qm = build existing query as now;
> bq.add(qm, ....);
> Query qthresh = NumericRangeQuery,whatever(whatever...);
> bq.add(qthresh, ...)
>
> and use bq in the search call.
>
>
> --
> Ian.
>
>
> On Wed, Mar 14, 2012 at 3:32 PM, Deb Lucene <de...@gmail.com> wrote:
> > Hi Group,
> >
> > I am working on a Lucene search solution for multiple fields. So far, if
> > the fields are of string type I am having no difficulties in retrieving
> > using the MultiFieldQueryParser. For example, my indexing and searching
> > logic look like this -
> >
> > indexing
> > - I am indexing a corpus on the content of the documents and some
> keywords
> > of the documents.
> >
> > **********************************************
> > String doc = getText(id) ;
> > List<String> keywords = getKeywords(doc);
> >  document.add(new Field("content", doc, Field.Store.NO
> ,Field.Index.ANALYZED,
> > Field.TermVector.YES));
> > for ( String keyword : keywords )
> >  {
> >    document.add(new Field("keyword", keyword, Field.Store.NO,
> > Field.Index.ANALYZED, Field.TermVector.YES));
> >  }
> > *********************************************
> > I am searching over the indexes using some query text and predefined
> > keywords
> > searching :
> > ********************************************
> > String queryText = getQuery();
> > String keyword = getKeyword();
> >  BooleanClause.Occur[] flags =
> > {BooleanClause.Occur.SHOULD,BooleanClause.Occur.SHOULD};
> >  Query query = MultiFieldQueryParser.parse(Version.LUCENE_33, new
> String[]
> > {queryText, keyword},
> >                 new String[]{"content","keywords"}, flags, stAnalyzer);
> > [stAnalyzer is the standard analyzer]
> >
> >  TopDocs hits = isearcher.search(query, 20);
> >
> > ********************************************
> >
> > This code is working fine. But now suppose I add one more field (a
> > "threshold" set on some prior calculation) which is of numeric type.
> > NumericField field = new NumericField("threshold") ;
> > document.add(field.setDoubleValue(threhold));
> >
> > Now can I search over multiple fields using the "string" type (i.e.
> content
> > and keywords) with the "double" type (i.e. the threshold)?
> > I am particularly looking for a query such as -
> > query - "some content" and "some keywords" and threshold > 0.5.
> >
> > I surmise I need to use the "numeric field search" technique but not sure
> > how to add the functionality in MultiFieldQueryParser.
> >
> > Thanks in advance,
> > --d
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Multi field search with values

Posted by Ian Lea <ia...@gmail.com>.
It the keywords are already in the document body (field "content") I
don't see what you gain by indexing them separately and using MFQP.
But that isn't what you are asking.  To add a threshold to the query
do something like this:

BooleanQuery bq = new BooleanQuery();
Query qm = build existing query as now;
bq.add(qm, ....);
Query qthresh = NumericRangeQuery,whatever(whatever...);
bq.add(qthresh, ...)

and use bq in the search call.


--
Ian.


On Wed, Mar 14, 2012 at 3:32 PM, Deb Lucene <de...@gmail.com> wrote:
> Hi Group,
>
> I am working on a Lucene search solution for multiple fields. So far, if
> the fields are of string type I am having no difficulties in retrieving
> using the MultiFieldQueryParser. For example, my indexing and searching
> logic look like this -
>
> indexing
> - I am indexing a corpus on the content of the documents and some keywords
> of the documents.
>
> **********************************************
> String doc = getText(id) ;
> List<String> keywords = getKeywords(doc);
>  document.add(new Field("content", doc, Field.Store.NO,Field.Index.ANALYZED,
> Field.TermVector.YES));
> for ( String keyword : keywords )
>  {
>    document.add(new Field("keyword", keyword, Field.Store.NO,
> Field.Index.ANALYZED, Field.TermVector.YES));
>  }
> *********************************************
> I am searching over the indexes using some query text and predefined
> keywords
> searching :
> ********************************************
> String queryText = getQuery();
> String keyword = getKeyword();
>  BooleanClause.Occur[] flags =
> {BooleanClause.Occur.SHOULD,BooleanClause.Occur.SHOULD};
>  Query query = MultiFieldQueryParser.parse(Version.LUCENE_33, new String[]
> {queryText, keyword},
>                 new String[]{"content","keywords"}, flags, stAnalyzer);
> [stAnalyzer is the standard analyzer]
>
>  TopDocs hits = isearcher.search(query, 20);
>
> ********************************************
>
> This code is working fine. But now suppose I add one more field (a
> "threshold" set on some prior calculation) which is of numeric type.
> NumericField field = new NumericField("threshold") ;
> document.add(field.setDoubleValue(threhold));
>
> Now can I search over multiple fields using the "string" type (i.e. content
> and keywords) with the "double" type (i.e. the threshold)?
> I am particularly looking for a query such as -
> query - "some content" and "some keywords" and threshold > 0.5.
>
> I surmise I need to use the "numeric field search" technique but not sure
> how to add the functionality in MultiFieldQueryParser.
>
> Thanks in advance,
> --d

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org