You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mufaddal Khumri <mk...@allegromedical.com> on 2006/02/20 17:05:17 UTC
StandardAnalyzer question ...
Hi,
When StandardAnalyzer is used to index documents, arent the terms,
amongst other things, lower cased and stored that ways in the index?
I have a index field that I index like this:
....
ramWriter = new IndexWriter(ramDir, standardAnalyzer, true);
....
...
...
doc.add(Field.Text("categoryNames", categoryNames));
...
...
(I periodically write contents from the ram directory to the file system
directory.)
When I search this field via luke using the standard analyzer I find
words like this:
....
Digital Cameras
Digital Camera Batteries
....
Shouldn't the words indexed look like:
....
digital cameras
digital camera batteries
....
If I understand this right, when using standard analyzer, shouldn't the
terms be indexed in lower case?
Thanks,
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: get results by relevance, limiting results and then sort the
results by some criterion
Posted by Mufaddal Khumri <mk...@allegromedical.com>.
Hi,
Thats exactly what I am doing currently. Was just wondering if there is
a lucene way to do what I am doing using QueryFilter etc.
-Thanks.
Dan Armbrust wrote:
> Mufaddal Khumri wrote:
>
>> When I do a search for example on "batteries" i get 1200+ results. I
>> would like to show the user lets say 300. I can do that by only
>> extracting the first 300 hits (sorted by decreasing relevance by
>> default) and displaying those to the user.
>>
>>
>
> If you are only talking about ordering the number of items that you
> are going to show to the user, that seems to imply that the number
> will be small. Why don't you just re-sort the items that you are
> going to display to the user somewhere in your code after you get the
> documents back from lucene? It may not be quite as clean, but I doubt
> that there will be any performance impact.
>
> Dan
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: get results by relevance, limiting results and then sort the
results by some criterion
Posted by Dan Armbrust <da...@gmail.com>.
Mufaddal Khumri wrote:
> When I do a search for example on "batteries" i get 1200+ results. I
> would like to show the user lets say 300. I can do that by only
> extracting the first 300 hits (sorted by decreasing relevance by
> default) and displaying those to the user.
>
>
If you are only talking about ordering the number of items that you are
going to show to the user, that seems to imply that the number will be
small. Why don't you just re-sort the items that you are going to
display to the user somewhere in your code after you get the documents
back from lucene? It may not be quite as clean, but I doubt that there
will be any performance impact.
Dan
--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: get results by relevance, limiting results and then sort the results by some criterion
Posted by Otis Gospodnetic <ot...@yahoo.com>.
It sounds like this is a webapp.
I'd consider playing with HTML DOM a little bit - come up with a system where I get top N matches by relevance, store them somewhere, and then just re-sort them using users' criteria, without going back to the Lucene index.
For instance, you could store this data inside some JavaScript arrays in the first results page, and re-sorting inside the client (browser). Why go all the way back to the server and disk?
Or, if you really want to go to the server, you could come up with a mechanism where the first set of N hits are stored some place on disk in whichever format is suitable (e.g. serialized object on disk, XML...) and then when the user wants to re-sort the matches, go to the server, grab the cached data, sort appropriately, and display. Smells like ajax, if you want to play with that.
Otis
----- Original Message ----
From: Mufaddal Khumri <mk...@allegromedical.com>
To: java-user@lucene.apache.org
Sent: Tue 21 Feb 2006 11:33:22 AM EST
Subject: get results by relevance, limiting results and then sort the results by some criterion
When I do a search for example on "batteries" i get 1200+ results. I
would like to show the user lets say 300. I can do that by only
extracting the first 300 hits (sorted by decreasing relevance by
default) and displaying those to the user.
Now on the search results page, I have a drop down box that lets the
user sort the results by price. When the user selects the "Sort by price
low to high", i would like to be able to sort the same 300 hits I got
above (sorted by decreasing relevance by default) by price.
Essentially I want to be able to sort the first 300 relevant search
results by price. (in other words I would like to be able to get search
results by relevance, limit the results and sort the results by some
criterion).
What would be a good way to do this in lucene?
-Thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
get results by relevance, limiting results and then sort the results
by some criterion
Posted by Mufaddal Khumri <mk...@allegromedical.com>.
When I do a search for example on "batteries" i get 1200+ results. I
would like to show the user lets say 300. I can do that by only
extracting the first 300 hits (sorted by decreasing relevance by
default) and displaying those to the user.
Now on the search results page, I have a drop down box that lets the
user sort the results by price. When the user selects the "Sort by price
low to high", i would like to be able to sort the same 300 hits I got
above (sorted by decreasing relevance by default) by price.
Essentially I want to be able to sort the first 300 relevant search
results by price. (in other words I would like to be able to get search
results by relevance, limit the results and sort the results by some
criterion).
What would be a good way to do this in lucene?
-Thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: exact match ..
Posted by Robert Watkins <rw...@foo-bar.org>.
The way I have solved the problem of allowing exact matches is, for each
field in which it is possible for an exact match to be requested, a
parallel field is created at index time that is unstemmed and has a
specific prefix:
if (fieldData.isSearched() && tokenize && usingStemmingAnalyzer) {
doc.add(new Field(UNSTEMMED_FIELD_PREFIX + fieldName,
fieldValueStr, false, true, true));
}
Also, I use a custom Analyzer for both indexing and searching that
understands this:
public TokenStream tokenStream(String fieldName, Reader reader)
{
TokenStream result = new WISTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
if (stoptable != null) {
result = new StopFilter(result, stoptable);
}
if (!fieldName.startsWith(UNSTEMMED_FIELD_PREFIX)) {
result = new SpellFilter(result);
result = new PorterStemFilter(result);
}
return result;
}
For searching, I've written a custom parser using JavaCC (I need to
support more operators than Lucene does OOTB), as well as a
QueryBuilder class that constructs the queries "manually" for each node
type. For a quoted string (i.e. requiring an exact match):
case JJTQUOTED:
if (node.hasWildcard()) {
Node phraseNode = SimpleNode.getPhraseNode(node.getName());
query = getSpanQuery(new Node[]{phraseNode}, currentField, 0);
}
else {
// match quoted strings "exactly", i.e. without stemming
// NB: matches are case insensitive
String fieldToSearch = usingStemmingAnalyzer ?
UNSTEMMED_FIELD_PREFIX + currentField : currentField;
query = getTerminalQuery(node.getName(), fieldToSearch);
}
break;
and:
protected Query getTerminalQuery(String term, String currentField)
throws QueryBuildingException
{
Query q;
try {
q = org.apache.lucene.queryParser.QueryParser.parse(term,
currentField, analyzer);
}
catch (org.apache.lucene.queryParser.ParseException e) {
throw new QueryBuildingException(e);
}
return q;
}
There is, obviously, a fair amount of work involved, but the level of
control is the payoff.
-- Robert
--------------------
Robert Watkins
rwatkins@foo-bar.org
--------------------
On Mon, 20 Feb 2006, Erik Hatcher wrote:
>
> Yes, this is what PerFieldAnalyzerWrapper provides for you, as described in
> detail in several sections of Lucene in Action:
>
> http://www.lucenebook.com/search?query=PerFieldAnalyzerWrapper
>
> Erik
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: exact match ..
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 20, 2006, at 1:22 PM, Mufaddal Khumri wrote:
> Just realized that the various fields I have are part of the same
> document. But in order to leverage the KeywordAnalyzer, I would
> have to now have two sets of document.
> One document with the fields: title, content <--- analyzed by
> custom analyzer
> Other document with the fields: categoryNames < ---- analyzed by
> keyword analyzer
>
> Is there a way I could have a single document object have some
> fields analyzed by my custom analyzer and the one field -
> "categoryNames" analyzed by the keyword analyzer?
Yes, this is what PerFieldAnalyzerWrapper provides for you, as
described in detail in several sections of Lucene in Action:
http://www.lucenebook.com/search?query=PerFieldAnalyzerWrapper
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: exact match ..
Posted by Mufaddal Khumri <mk...@allegromedical.com>.
Hi,
Just realized that the various fields I have are part of the same
document. But in order to leverage the KeywordAnalyzer, I would have to
now have two sets of document.
One document with the fields: title, content <--- analyzed by custom
analyzer
Other document with the fields: categoryNames < ---- analyzed by keyword
analyzer
Is there a way I could have a single document object have some fields
analyzed by my custom analyzer and the one field - "categoryNames"
analyzed by the keyword analyzer?
Thanks,
Mufaddal Khumri wrote:
> Hi Steve,
>
> If I understand you right, I could use something like the Keyword
> analyzer to tokenize the entire stream as a single token and store
> that in the index. I could definitely the keyword analyzer while
> indexing this particular field "categoryNames".
>
> Now my questions is on how to search and boost this since this is part
> of a bigger boolean query in my case.
>
> My typical query actually looks like:
>
> +(+content:digit +content:camera) +entity:product +(title:"digit
> camera"~2^40.0 ((title:digit title:camera)^10.0) content:"digit
> camera"~2^20.0 (content:digit content:camera) categoryNames:"digit
> camera"^80.0)
>
> As you can see i was trying to do a phrase query on the categoryNames
> field and boosting it by 80.0.
> Also I am using the potter stemming filter to stem while searching. (I
> do this while indexing as well). If I go with the KeywordAnalyzer
> approach I can index the categoryNames field using this analyzer .
>
> Would I be using the QueryParser to create my query and specify the
> keyword analyzer to it while searching on categoryNames ? (and then
> make that query part of my global boolean query?)
>
> -Thanks.
>
>
>
>
>
> Steven Rowe wrote:
>
>> Mufaddal Khumri wrote:
>>
>>> lets say i do this while indexing:
>>>
>>> doc.add(Field.Text("categoryNames", categoryNames));
>>>
>>> Now while searching categoryNames, I do a search for "digital
>>> cameras". I only want to match the exact phrase digital cameras with
>>> documents who have exactly the phrase "digital cameras" in the
>>> categoryNames field. I do not want results that have "digital camera
>>> batteries" part of the result.
>>>
>>> Whats the best way to accomplish this?
>>
>>
>>
>> Hi Mufaddal,
>>
>> One way to do this is to use the KeywordAnalyzer (in the Lucene
>> Subversion trunk, but not in v1.4.3; will be in forthcoming v1.9) for
>> the "categoryNames" field. This analyzer does not tokenize field
>> contents, so "digital cameras" would be a single token, and the only
>> thing that would match it would be the exact same single token. Be
>> careful when you search to construct the search tokens similarly.
>>
>> If you have other fields you want to search, and you want to tokenize
>> their contents when you index them, you could use the
>> PerFieldAnalyzerWrapper, so that the KeywordAnalyzer is only used for
>> the "categoryNames" field.
>>
>> Steve
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: exact match ..
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 20, 2006, at 1:02 PM, Mufaddal Khumri wrote:
> If I understand you right, I could use something like the Keyword
> analyzer to tokenize the entire stream as a single token and store
> that in the index. I could definitely the keyword analyzer while
> indexing this particular field "categoryNames".
The KeywordAnalyzer is not needed for indexing... simply use
Field.Keyword() for indexing without analysis. Beware of case
sensitivity though.
> +(+content:digit +content:camera) +entity:product +(title:"digit
> camera"~2^40.0 ((title:digit title:camera)^10.0) content:"digit
> camera"~2^20.0 (content:digit content:camera) categoryNames:"digit
> camera"^80.0)
>
> As you can see i was trying to do a phrase query on the
> categoryNames field and boosting it by 80.0.
> Also I am using the potter stemming filter to stem while searching.
> (I do this while indexing as well). If I go with the
> KeywordAnalyzer approach I can index the categoryNames field using
> this analyzer .
>
> Would I be using the QueryParser to create my query and specify the
> keyword analyzer to it while searching on categoryNames ? (and then
> make that query part of my global boolean query?)
You can use the PerFieldAnalyzerWrapper with the KeywordAnalyzer
assigned to your categoryNames field, sure, but you wouldn't have
stemming capability at that point.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: exact match ..
Posted by Mufaddal Khumri <mk...@allegromedical.com>.
Hi Steve,
If I understand you right, I could use something like the Keyword
analyzer to tokenize the entire stream as a single token and store that
in the index. I could definitely the keyword analyzer while indexing
this particular field "categoryNames".
Now my questions is on how to search and boost this since this is part
of a bigger boolean query in my case.
My typical query actually looks like:
+(+content:digit +content:camera) +entity:product +(title:"digit
camera"~2^40.0 ((title:digit title:camera)^10.0) content:"digit
camera"~2^20.0 (content:digit content:camera) categoryNames:"digit
camera"^80.0)
As you can see i was trying to do a phrase query on the categoryNames
field and boosting it by 80.0.
Also I am using the potter stemming filter to stem while searching. (I
do this while indexing as well). If I go with the KeywordAnalyzer
approach I can index the categoryNames field using this analyzer .
Would I be using the QueryParser to create my query and specify the
keyword analyzer to it while searching on categoryNames ? (and then make
that query part of my global boolean query?)
-Thanks.
Steven Rowe wrote:
> Mufaddal Khumri wrote:
>
>> lets say i do this while indexing:
>>
>> doc.add(Field.Text("categoryNames", categoryNames));
>>
>> Now while searching categoryNames, I do a search for "digital
>> cameras". I only want to match the exact phrase digital cameras with
>> documents who have exactly the phrase "digital cameras" in the
>> categoryNames field. I do not want results that have "digital camera
>> batteries" part of the result.
>>
>> Whats the best way to accomplish this?
>
>
> Hi Mufaddal,
>
> One way to do this is to use the KeywordAnalyzer (in the Lucene
> Subversion trunk, but not in v1.4.3; will be in forthcoming v1.9) for
> the "categoryNames" field. This analyzer does not tokenize field
> contents, so "digital cameras" would be a single token, and the only
> thing that would match it would be the exact same single token. Be
> careful when you search to construct the search tokens similarly.
>
> If you have other fields you want to search, and you want to tokenize
> their contents when you index them, you could use the
> PerFieldAnalyzerWrapper, so that the KeywordAnalyzer is only used for
> the "categoryNames" field.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: exact match ..
Posted by Steven Rowe <sa...@syr.edu>.
Mufaddal Khumri wrote:
> lets say i do this while indexing:
>
> doc.add(Field.Text("categoryNames", categoryNames));
>
> Now while searching categoryNames, I do a search for "digital cameras".
> I only want to match the exact phrase digital cameras with documents who
> have exactly the phrase "digital cameras" in the categoryNames field. I
> do not want results that have "digital camera batteries" part of the
> result.
>
> Whats the best way to accomplish this?
Hi Mufaddal,
One way to do this is to use the KeywordAnalyzer (in the Lucene
Subversion trunk, but not in v1.4.3; will be in forthcoming v1.9) for
the "categoryNames" field. This analyzer does not tokenize field
contents, so "digital cameras" would be a single token, and the only
thing that would match it would be the exact same single token. Be
careful when you search to construct the search tokens similarly.
If you have other fields you want to search, and you want to tokenize
their contents when you index them, you could use the
PerFieldAnalyzerWrapper, so that the KeywordAnalyzer is only used for
the "categoryNames" field.
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: span first query and boosting ..
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 20, 2006, at 12:22 PM, Mufaddal Khumri wrote:
> Hi,
>
> I do this:
>
> SpanFirstQuery fullPhraseInCategoryNamesQuery = new SpanFirstQuery
> (new SpanTermQuery(new Term("categoryNames", "digital cameras")), 2);
> fullPhraseInCategoryNamesQuery.setBoost(8);
>
> In my log output i get this:
>
> spanFirst(categoryNames:digit camera, 2))
>
> Why cant I boost a span query? What am i doing wrong?
You can boost any Query. However, the .toString is not showing the
boost. Look at IndexSearcher.explain() results to see the effect of
your boosts in action.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
span first query and boosting ..
Posted by Mufaddal Khumri <mk...@allegromedical.com>.
Hi,
I do this:
SpanFirstQuery fullPhraseInCategoryNamesQuery = new SpanFirstQuery(new
SpanTermQuery(new Term("categoryNames", "digital cameras")), 2);
fullPhraseInCategoryNamesQuery.setBoost(8);
In my log output i get this:
spanFirst(categoryNames:digit camera, 2))
Why cant I boost a span query? What am i doing wrong?
-Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
exact match ..
Posted by Mufaddal Khumri <mk...@allegromedical.com>.
lets say i do this while indexing:
doc.add(Field.Text("categoryNames", categoryNames));
Now while searching categoryNames, I do a search for "digital cameras".
I only want to match the exact phrase digital cameras with documents who
have exactly the phrase "digital cameras" in the categoryNames field. I
do not want results that have "digital camera batteries" part of the
result.
Whats the best way to accomplish this?
thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: StandardAnalyzer question ...
Posted by Oskar Berger <os...@agent25.se>.
Hello,
Not yet an expert in the field, but as I've understood the thing the
terms are indexed as you specify them (through the filters) but the
contents are stored depending on whether you want it or not
(Filed.UnStored(), which happens to be on its way to get deprecated).
So maybe you search the lower cased but indeed get the cased as the
result in this very CASE.
/oskar
On Mon, 2006-02-20 at 09:05 -0700, Mufaddal Khumri wrote:
> Hi,
>
> When StandardAnalyzer is used to index documents, arent the terms,
> amongst other things, lower cased and stored that ways in the index?
>
> I have a index field that I index like this:
>
> ....
> ramWriter = new IndexWriter(ramDir, standardAnalyzer, true);
> ....
> ...
> ...
> doc.add(Field.Text("categoryNames", categoryNames));
> ...
> ...
>
> (I periodically write contents from the ram directory to the file system
> directory.)
>
> When I search this field via luke using the standard analyzer I find
> words like this:
> ....
> Digital Cameras
> Digital Camera Batteries
> ....
>
> Shouldn't the words indexed look like:
>
> ....
> digital cameras
> digital camera batteries
> ....
>
> If I understand this right, when using standard analyzer, shouldn't the
> terms be indexed in lower case?
>
> Thanks,
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org