You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mufaddal Khumri <mk...@allegromedical.com> on 2006/02/20 17:05:17 UTC

StandardAnalyzer question ...

Hi,

When StandardAnalyzer is used to index documents, arent the terms, 
amongst other things, lower cased and stored that ways in the index?

I have a index field that I index like this:

....
ramWriter = new IndexWriter(ramDir, standardAnalyzer, true);
....
...
...
doc.add(Field.Text("categoryNames", categoryNames));
...
...

(I periodically write contents from the ram directory to the file system 
directory.)

When I search this field via luke using the standard analyzer I find 
words like this:
....
Digital Cameras
Digital Camera Batteries
....

Shouldn't the words indexed look like:

....
digital cameras
digital camera batteries
....

If I understand this right, when using standard analyzer, shouldn't the 
terms be indexed  in lower case?

Thanks,


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: get results by relevance, limiting results and then sort the results by some criterion

Posted by Mufaddal Khumri <mk...@allegromedical.com>.

Hi,

Thats exactly what I am doing currently. Was just wondering if there is 
a lucene way to do what I am doing using QueryFilter etc.

-Thanks.

Dan Armbrust wrote:

> Mufaddal Khumri wrote:
>
>> When I do a search for example on "batteries" i get 1200+ results. I 
>> would like to show the user lets say 300. I can do that by only 
>> extracting the first 300 hits (sorted by decreasing relevance by 
>> default) and displaying those to the user.
>>
>>
>
> If you are only talking about ordering the number of items that you 
> are going to show to the user, that seems to imply that the number 
> will be small.  Why don't you just re-sort the items that you are 
> going to display to the user somewhere in your code after you get the 
> documents back from lucene?  It may not be quite as clean, but I doubt 
> that there will be any performance impact.
>
> Dan
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: get results by relevance, limiting results and then sort the results by some criterion

Posted by Dan Armbrust <da...@gmail.com>.

Mufaddal Khumri wrote:
> When I do a search for example on "batteries" i get 1200+ results. I 
> would like to show the user lets say 300. I can do that by only 
> extracting the first 300 hits (sorted by decreasing relevance by 
> default) and displaying those to the user.
> 
> 

If you are only talking about ordering the number of items that you are 
going to show to the user, that seems to imply that the number will be 
small.  Why don't you just re-sort the items that you are going to 
display to the user somewhere in your code after you get the documents 
back from lucene?  It may not be quite as clean, but I doubt that there 
will be any performance impact.

Dan

-- 
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: get results by relevance, limiting results and then sort the results by some criterion

Posted by Otis Gospodnetic <ot...@yahoo.com>.

It sounds like this is a webapp.
I'd consider playing with HTML DOM a little bit - come up with a system where I get top N matches by relevance, store them somewhere, and then just re-sort them using users' criteria, without going back to the Lucene index.

For instance, you could store this data inside some JavaScript arrays in the first results page, and re-sorting inside the client (browser).  Why go all the way back to the server and disk?

Or, if you really want to go to the server, you could come up with a mechanism where the first set of N hits are stored some place on disk in whichever format is suitable (e.g. serialized object on disk, XML...) and then when the user wants to re-sort the matches, go to the server, grab the cached data, sort appropriately, and display.  Smells like ajax, if you want to play with that.

Otis

----- Original Message ----
From: Mufaddal Khumri <mk...@allegromedical.com>
To: java-user@lucene.apache.org
Sent: Tue 21 Feb 2006 11:33:22 AM EST
Subject: get results by relevance, limiting results and then sort the results by some criterion

When I do a search for example on "batteries" i get 1200+ results. I 
would like to show the user lets say 300. I can do that by only 
extracting the first 300 hits (sorted by decreasing relevance by 
default) and displaying those to the user.

Now on the search results page, I have a drop down box that lets the 
user sort the results by price. When the user selects the "Sort by price 
low to high", i would like to be able to sort the same 300 hits I got 
above (sorted by decreasing relevance by default) by price.

Essentially I want to be able to sort the first 300 relevant search 
results by price. (in other words I would like to be able to get search 
results by relevance, limit the results and sort the results by some 
criterion).

What would be a good way to do this in lucene?

-Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

get results by relevance, limiting results and then sort the results by some criterion

Posted by Mufaddal Khumri <mk...@allegromedical.com>.

When I do a search for example on "batteries" i get 1200+ results. I 
would like to show the user lets say 300. I can do that by only 
extracting the first 300 hits (sorted by decreasing relevance by 
default) and displaying those to the user.

Now on the search results page, I have a drop down box that lets the 
user sort the results by price. When the user selects the "Sort by price 
low to high", i would like to be able to sort the same 300 hits I got 
above (sorted by decreasing relevance by default) by price.

Essentially I want to be able to sort the first 300 relevant search 
results by price. (in other words I would like to be able to get search 
results by relevance, limit the results and sort the results by some 
criterion).

What would be a good way to do this in lucene?

-Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: exact match ..

Posted by Robert Watkins <rw...@foo-bar.org>.

The way I have solved the problem of allowing exact matches is, for each
field in which it is possible for an exact match to be requested, a
parallel field is created at index time that is unstemmed and has a
specific prefix:

 	if (fieldData.isSearched() && tokenize && usingStemmingAnalyzer) {
 		doc.add(new Field(UNSTEMMED_FIELD_PREFIX + fieldName,
 			fieldValueStr, false, true, true));
 	}

Also, I use a custom Analyzer for both indexing and searching that
understands this:

 	public TokenStream tokenStream(String fieldName, Reader reader)
 	{
 		TokenStream result = new WISTokenizer(reader);
 		result = new StandardFilter(result);
 		result = new LowerCaseFilter(result);
 		if (stoptable != null) {
 			result = new StopFilter(result, stoptable);
 		}
 		if (!fieldName.startsWith(UNSTEMMED_FIELD_PREFIX)) {
 			result = new SpellFilter(result);
 			result = new PorterStemFilter(result);
 		}
 		return result;
 	}

For searching, I've written a custom parser using JavaCC (I need to
support more operators than Lucene does OOTB), as well as a
QueryBuilder class that constructs the queries "manually" for each node
type. For a quoted string (i.e. requiring an exact match):

 	case JJTQUOTED:
 		if (node.hasWildcard()) {
 			Node phraseNode = SimpleNode.getPhraseNode(node.getName());
 			query = getSpanQuery(new Node[]{phraseNode}, currentField, 0);
 		}
 		else {
 			// match quoted strings "exactly", i.e. without stemming
 			// NB: matches are case insensitive
 			String fieldToSearch = usingStemmingAnalyzer ?
 				UNSTEMMED_FIELD_PREFIX + currentField : currentField;
 			query = getTerminalQuery(node.getName(), fieldToSearch);
 		}
 		break;

and:

 	protected Query getTerminalQuery(String term, String currentField)
 		throws QueryBuildingException
 	{
 		Query q;
 		try {
 			q = org.apache.lucene.queryParser.QueryParser.parse(term,
 				currentField, analyzer);
 		}
 		catch (org.apache.lucene.queryParser.ParseException e) {
 			throw new QueryBuildingException(e);
 		}
 		return q;
 	}

There is, obviously, a fair amount of work involved, but the level of
control is the payoff.

-- Robert

--------------------
Robert Watkins
rwatkins@foo-bar.org
--------------------

On Mon, 20 Feb 2006, Erik Hatcher wrote:

>
> Yes, this is what PerFieldAnalyzerWrapper provides for you, as described in 
> detail in several sections of Lucene in Action:
>
> 	http://www.lucenebook.com/search?query=PerFieldAnalyzerWrapper
>
> Erik
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: exact match ..

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 20, 2006, at 1:22 PM, Mufaddal Khumri wrote:
> Just realized that the various fields I have are part of the same  
> document. But in order to leverage the KeywordAnalyzer, I would  
> have to now have two sets of document.
> One document with the fields: title, content <--- analyzed by  
> custom analyzer
> Other document with the fields: categoryNames < ---- analyzed by  
> keyword analyzer
>
> Is there a way I could have a single document object have some  
> fields analyzed by my custom analyzer and the one field -  
> "categoryNames" analyzed by the keyword analyzer?

Yes, this is what PerFieldAnalyzerWrapper provides for you, as  
described in detail in several sections of Lucene in Action:

	http://www.lucenebook.com/search?query=PerFieldAnalyzerWrapper

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: exact match ..

Posted by Mufaddal Khumri <mk...@allegromedical.com>.

Hi,

Just realized that the various fields I have are part of the same 
document. But in order to leverage the KeywordAnalyzer, I would have to 
now have two sets of document.
One document with the fields: title, content <--- analyzed by custom 
analyzer
Other document with the fields: categoryNames < ---- analyzed by keyword 
analyzer

Is there a way I could have a single document object have some fields 
analyzed by my custom analyzer and the one field - "categoryNames" 
analyzed by the keyword analyzer?

Thanks,

Mufaddal Khumri wrote:

> Hi Steve,
>
> If I understand you right, I could use something like the Keyword 
> analyzer to tokenize the entire stream as a single token and store 
> that in the index. I could definitely the keyword analyzer while 
> indexing this particular field "categoryNames".
>
> Now my questions is on how to search and boost this since this is part 
> of a bigger boolean query in my case.
>
> My typical query actually looks like:
>
> +(+content:digit +content:camera) +entity:product +(title:"digit 
> camera"~2^40.0 ((title:digit title:camera)^10.0) content:"digit 
> camera"~2^20.0 (content:digit content:camera) categoryNames:"digit 
> camera"^80.0)
>
> As you can see i was trying to do a phrase query on the categoryNames 
> field and boosting it by 80.0.
> Also I am using the potter stemming filter to stem while searching. (I 
> do this while indexing as well). If I go with the KeywordAnalyzer 
> approach I can index the categoryNames field using this analyzer .
>
> Would I be using the QueryParser to create my query and specify the 
> keyword analyzer to it while searching on categoryNames ? (and then 
> make that query part of my global boolean query?)
>
> -Thanks.
>
>
>
>
>
> Steven Rowe wrote:
>
>> Mufaddal Khumri wrote:
>>
>>> lets say i do this while indexing:
>>>
>>> doc.add(Field.Text("categoryNames", categoryNames));
>>>
>>> Now while searching categoryNames, I do a search for "digital 
>>> cameras". I only want to match the exact phrase digital cameras with 
>>> documents who have exactly the phrase "digital cameras" in the 
>>> categoryNames field. I do not want results that have "digital camera 
>>> batteries" part of the result.
>>>
>>> Whats the best way to accomplish this?
>>
>>
>>
>> Hi Mufaddal,
>>
>> One way to do this is to use the KeywordAnalyzer (in the Lucene 
>> Subversion trunk, but not in v1.4.3; will be in forthcoming v1.9) for 
>> the "categoryNames" field.  This analyzer does not tokenize field 
>> contents, so "digital cameras" would be a single token, and the only 
>> thing that would match it would be the exact same single token.  Be 
>> careful when you search to construct the search tokens similarly.
>>
>> If you have other fields you want to search, and you want to tokenize 
>> their contents when you index them, you could use the 
>> PerFieldAnalyzerWrapper, so that the KeywordAnalyzer is only used for 
>> the "categoryNames" field.
>>
>> Steve
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: exact match ..

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 20, 2006, at 1:02 PM, Mufaddal Khumri wrote:
> If I understand you right, I could use something like the Keyword  
> analyzer to tokenize the entire stream as a single token and store  
> that in the index. I could definitely the keyword analyzer while  
> indexing this particular field "categoryNames".

The KeywordAnalyzer is not needed for indexing... simply use  
Field.Keyword() for indexing without analysis.  Beware of case  
sensitivity though.

> +(+content:digit +content:camera) +entity:product +(title:"digit  
> camera"~2^40.0 ((title:digit title:camera)^10.0) content:"digit  
> camera"~2^20.0 (content:digit content:camera) categoryNames:"digit  
> camera"^80.0)
>
> As you can see i was trying to do a phrase query on the  
> categoryNames field and boosting it by 80.0.
> Also I am using the potter stemming filter to stem while searching.  
> (I do this while indexing as well). If I go with the  
> KeywordAnalyzer approach I can index the categoryNames field using  
> this analyzer .
>
> Would I be using the QueryParser to create my query and specify the  
> keyword analyzer to it while searching on categoryNames ? (and then  
> make that query part of my global boolean query?)

You can use the PerFieldAnalyzerWrapper with the KeywordAnalyzer  
assigned to your categoryNames field, sure, but you wouldn't have  
stemming capability at that point.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: exact match ..

Posted by Mufaddal Khumri <mk...@allegromedical.com>.

Hi Steve,

If I understand you right, I could use something like the Keyword 
analyzer to tokenize the entire stream as a single token and store that 
in the index. I could definitely the keyword analyzer while indexing 
this particular field "categoryNames".

Now my questions is on how to search and boost this since this is part 
of a bigger boolean query in my case.

My typical query actually looks like:

+(+content:digit +content:camera) +entity:product +(title:"digit 
camera"~2^40.0 ((title:digit title:camera)^10.0) content:"digit 
camera"~2^20.0 (content:digit content:camera) categoryNames:"digit 
camera"^80.0)

As you can see i was trying to do a phrase query on the categoryNames 
field and boosting it by 80.0.
Also I am using the potter stemming filter to stem while searching. (I 
do this while indexing as well). If I go with the KeywordAnalyzer 
approach I can index the categoryNames field using this analyzer .

Would I be using the QueryParser to create my query and specify the 
keyword analyzer to it while searching on categoryNames ? (and then make 
that query part of my global boolean query?)

-Thanks.

Steven Rowe wrote:

> Mufaddal Khumri wrote:
>
>> lets say i do this while indexing:
>>
>> doc.add(Field.Text("categoryNames", categoryNames));
>>
>> Now while searching categoryNames, I do a search for "digital 
>> cameras". I only want to match the exact phrase digital cameras with 
>> documents who have exactly the phrase "digital cameras" in the 
>> categoryNames field. I do not want results that have "digital camera 
>> batteries" part of the result.
>>
>> Whats the best way to accomplish this?
>
>
> Hi Mufaddal,
>
> One way to do this is to use the KeywordAnalyzer (in the Lucene 
> Subversion trunk, but not in v1.4.3; will be in forthcoming v1.9) for 
> the "categoryNames" field.  This analyzer does not tokenize field 
> contents, so "digital cameras" would be a single token, and the only 
> thing that would match it would be the exact same single token.  Be 
> careful when you search to construct the search tokens similarly.
>
> If you have other fields you want to search, and you want to tokenize 
> their contents when you index them, you could use the 
> PerFieldAnalyzerWrapper, so that the KeywordAnalyzer is only used for 
> the "categoryNames" field.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: exact match ..

Posted by Steven Rowe <sa...@syr.edu>.

Mufaddal Khumri wrote:
> lets say i do this while indexing:
> 
> doc.add(Field.Text("categoryNames", categoryNames));
> 
> Now while searching categoryNames, I do a search for "digital cameras". 
> I only want to match the exact phrase digital cameras with documents who 
> have exactly the phrase "digital cameras" in the categoryNames field. I 
> do not want results that have "digital camera batteries" part of the 
> result.
> 
> Whats the best way to accomplish this?

Hi Mufaddal,

One way to do this is to use the KeywordAnalyzer (in the Lucene 
Subversion trunk, but not in v1.4.3; will be in forthcoming v1.9) for 
the "categoryNames" field.  This analyzer does not tokenize field 
contents, so "digital cameras" would be a single token, and the only 
thing that would match it would be the exact same single token.  Be 
careful when you search to construct the search tokens similarly.

If you have other fields you want to search, and you want to tokenize 
their contents when you index them, you could use the 
PerFieldAnalyzerWrapper, so that the KeywordAnalyzer is only used for 
the "categoryNames" field.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: span first query and boosting ..

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 20, 2006, at 12:22 PM, Mufaddal Khumri wrote:

> Hi,
>
> I do this:
>
> SpanFirstQuery fullPhraseInCategoryNamesQuery = new SpanFirstQuery 
> (new SpanTermQuery(new Term("categoryNames", "digital cameras")), 2);
> fullPhraseInCategoryNamesQuery.setBoost(8);
>
> In my log output i get this:
>
> spanFirst(categoryNames:digit camera, 2))
>
> Why cant I boost a span query? What am i doing wrong?

You can boost any Query.  However, the .toString is not showing the  
boost.  Look at IndexSearcher.explain() results to see the effect of  
your boosts in action.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

span first query and boosting ..

Posted by Mufaddal Khumri <mk...@allegromedical.com>.

Hi,

I do this:

SpanFirstQuery fullPhraseInCategoryNamesQuery = new SpanFirstQuery(new 
SpanTermQuery(new Term("categoryNames", "digital cameras")), 2);
fullPhraseInCategoryNamesQuery.setBoost(8);

In my log output i get this:

spanFirst(categoryNames:digit camera, 2))

Why cant I boost a span query? What am i doing wrong?

-Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

exact match ..

Posted by Mufaddal Khumri <mk...@allegromedical.com>.

lets say i do this while indexing:

doc.add(Field.Text("categoryNames", categoryNames));

Now while searching categoryNames, I do a search for "digital cameras". 
I only want to match the exact phrase digital cameras with documents who 
have exactly the phrase "digital cameras" in the categoryNames field. I 
do not want results that have "digital camera batteries" part of the 
result.

Whats the best way to accomplish this?

thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardAnalyzer question ...

Posted by Oskar Berger <os...@agent25.se>.

Hello,

Not yet an expert in the field, but as I've understood the thing the
terms are indexed as you specify them (through the filters) but the
contents are stored depending on whether you want it or not
(Filed.UnStored(), which happens to be on its way to get deprecated).

So maybe you search the lower cased but indeed get the cased as the
result in this very CASE.

/oskar 

On Mon, 2006-02-20 at 09:05 -0700, Mufaddal Khumri wrote:
> Hi,
> 
> When StandardAnalyzer is used to index documents, arent the terms, 
> amongst other things, lower cased and stored that ways in the index?
> 
> I have a index field that I index like this:
> 
> ....
> ramWriter = new IndexWriter(ramDir, standardAnalyzer, true);
> ....
> ...
> ...
> doc.add(Field.Text("categoryNames", categoryNames));
> ...
> ...
> 
> (I periodically write contents from the ram directory to the file system 
> directory.)
> 
> When I search this field via luke using the standard analyzer I find 
> words like this:
> ....
> Digital Cameras
> Digital Camera Batteries
> ....
> 
> Shouldn't the words indexed look like:
> 
> ....
> digital cameras
> digital camera batteries
> ....
> 
> If I understand this right, when using standard analyzer, shouldn't the 
> terms be indexed  in lower case?
> 
> Thanks,
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org