You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Martin Rode <ma...@programmfabrik.de> on 2005/08/29 10:50:50 UTC

Did you mean?

Hi everybody,

Has anyone tried to code a solution like Google's "Did you mean?" in 
Lucene?

I would be very happy to hear your ideas, approaches, suggestions.

Best,
Martin




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Did you mean?

Posted by Dave Kor <s0...@sms.ed.ac.uk>.
Quoting Martin Rode <ma...@programmfabrik.de>:

> Hi everybody,
>
> Has anyone tried to code a solution like Google's "Did you mean?" in
> Lucene?
>
> I would be very happy to hear your ideas, approaches, suggestions.

I know that what Google does is look at consecutive queries by the same user
that are similar. If the two queries are very similar, with only one or two
characters changed, there is a very high probability that one of the query is a
correct spelling while the other is a "common" misspelling. Its easy to figure
which is the correct spelling by looking up the words in a dictionary. All they
have to do now is add the mispelt store the correct and mispelt word pair in a
mapping table and reference that table for every query.

Of course, this only works because Google's huge query volume ensures that they
can get sufficient quantities of such query pairs.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Did you mean?

Posted by Jason Haruska <jh...@gmail.com>.
To add to other comments:

This functionality should also look at how common a term is in the corpus. 
Using the corpus as "correct" set of terms to search on isn't always what 
you want if the corpus is unclean (misspellings, etc.)

I believe this is why if you search on an uncommon term, Google will try to 
suggest something more common, even if you spelled the term correctly.

On 8/29/05, Chris Lu <ch...@gmail.com> wrote:
> 
> Constructing a separated index as a dictionary is one part of solution.
> 
> The other part is to construct a dictionary with a list of possible
> "good words".
> By "good words", I mean all leagal queries, not necessarily "correct 
> words".
> Two approaches I can think of:
> * Use a word list(it may not be the word list you want, but it is just
> a compromise).
> * Analyze your original index, listing out all words inside.
> 
> There should be other approaches. Anyone?
> 
> --
> Chris Lu
> ------------
> Lucene Search RAD on Any Database
> http://www.dbsight.net
> 
> On 8/29/05, Joseph B. Ottinger <jo...@enigmastation.com> wrote:
> > java.net <http://java.net> had an article on this not long ago. See
> > http://today.java.net/pub/a/today/2005/08/09/didyoumean.html .
> >
> > On Mon, 29 Aug 2005, Martin Rode wrote:
> >
> > > Hi everybody,
> > >
> > > Has anyone tried to code a solution like Google's "Did you mean?" in 
> Lucene?
> > >
> > > I would be very happy to hear your ideas, approaches, suggestions.
> > >
> > > Best,
> > > Martin
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > -----------------------------------------------------------------------
> > Joseph B. Ottinger http://enigmastation.com
> > Editor, http://www.TheServerSide.com joeo@enigmastation.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

Re: Did you mean?

Posted by Chris Lu <ch...@gmail.com>.
On 9/2/05, Paul Libbrecht <pa...@activemath.org> wrote:
> Isn't this relatively easily done using current indexReader methods?
> My 2p would be (I intended to do it):
> - have each of your words get analyzed in each flavour (eg stemmed,
> phonetic...)
> - get a tokens in each flavour and compare to that
> - map back (that's the part I haven't done yet).

Mapping back the suggested words after the stemming may be a problem.
Unless we store the mapping somewhere.
-- 
Chris Lu
------------
Lucene Search RAD on Any Database
http://www.dbsight.net

> 
> This is away from frequent search but realizes the "Did you mean"
> paradigm and is quite enough in many cases, I believe.
> 
> paul
> 
> 
> Le 29 août 05, à 19:08, Chris Lu a écrit :
> 
> > * Analyze your original index, listing out all words inside.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Did you mean?

Posted by Paul Libbrecht <pa...@activemath.org>.
Isn't this relatively easily done using current indexReader methods?
My 2p would be (I intended to do it):
- have each of your words get analyzed in each flavour (eg stemmed, 
phonetic...)
- get a tokens in each flavour and compare to that
- map back (that's the part I haven't done yet).

This is away from frequent search but realizes the "Did you mean" 
paradigm and is quite enough in many cases, I believe.

paul


Le 29 août 05, à 19:08, Chris Lu a écrit :

> * Analyze your original index, listing out all words inside.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Can't return Hits!

Posted by do...@gmx.de.
Hi,

i want to return the Hits!! For listing them out!
But i get this Exception!

Exception in thread "main" java.io.IOException: Das Handle ist ungültig
	at java.io.RandomAccessFile.seek(Native Method)
	at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:415)
	at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
	at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
	at org.apache.lucene.store.InputStream.readBytes(InputStream.java:57)
	at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:220)
	at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
	at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
	at org.apache.lucene.store.InputStream.readInt(InputStream.java:73)
	at org.apache.lucene.store.InputStream.readLong(InputStream.java:96)
	at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:59)
	at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237)
	at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:74)
	at org.apache.lucene.search.Hits.doc(Hits.java:101)
	at search.Searcher.main(Searcher.java:76)


What i do is:

public void search() {
   ....
   hits = searcher.search(query);
}

public Hits getHits() {
   return hits;
}

public static void main (String [] args) {
   ...
   Hits h; 
   h = s.getHits();
   System.out.println(h.length());       //returns a number
   System.out.println(h.doc(0)!=null);   //returns an exception
}

If i say h.doc(0) and the length is not 0, it throws an exception!!
Why this could be???

Bye Derya

-- 
GMX DSL = Maximale Leistung zum minimalen Preis!
2000 MB nur 2,99, Flatrate ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Did you mean?

Posted by markharw00d <ma...@yahoo.co.uk>.
The "did you mean" implementation should ideally use all of the other 
words in a query as context to guide the selection of spelling 
alternatives. Google appear to do this - not sure if they use the doc 
content or user queries to suggest the alternatives.
I've got some colocation finding code which can be run on an existing 
index to discover commonly colocated  terms from doc contents. This 
could be of use here.


		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Did you mean?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I wonder if it would further help for the spell checked to make use of
something like WordNet (for English only), where low-frequency words
are "double-checked" against WordNet before considered correct.

Otis

--- Tom White <to...@gmail.com> wrote:

> On 8/29/05, Chris Lu <ch...@gmail.com> wrote:
> > 
> > 
> > Two approaches I can think of:
> > * Use a word list(it may not be the word list you want, but it is
> just
> > a compromise).
> > * Analyze your original index, listing out all words inside.
> > 
> > 
> Using a word list suffers from two problems:
> 1. (Coverage) No word list is complete, they need to be maintained as
> new 
> words are coined, and word lists for different languages vary in
> quality.
> 2. (Useless suggestions) There is little point in making a suggestion
> for 
> words that aren't in the original index (as they wouldn't produce any
> hits).
> 
> For these reasons it is better to use the original index as a source
> of 
> words. It is true that the index will likely contain spelling errors,
> 
> however Lucene Spell Checker provides a way to restrict suggestions
> to words 
> that are more popular than the query term. As misspellings are
> typically 
> rarer than correct spellings this should ensure that misspelled
> suggestions 
> are almost never made. The article quoted above (which I wrote),
> provides a 
> bit more discussion.
> 
> Tom
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Did you mean?

Posted by Tom White <to...@gmail.com>.
On 8/29/05, Chris Lu <ch...@gmail.com> wrote:
> 
> 
> Two approaches I can think of:
> * Use a word list(it may not be the word list you want, but it is just
> a compromise).
> * Analyze your original index, listing out all words inside.
> 
> 
Using a word list suffers from two problems:
1. (Coverage) No word list is complete, they need to be maintained as new 
words are coined, and word lists for different languages vary in quality.
2. (Useless suggestions) There is little point in making a suggestion for 
words that aren't in the original index (as they wouldn't produce any hits).

For these reasons it is better to use the original index as a source of 
words. It is true that the index will likely contain spelling errors, 
however Lucene Spell Checker provides a way to restrict suggestions to words 
that are more popular than the query term. As misspellings are typically 
rarer than correct spellings this should ensure that misspelled suggestions 
are almost never made. The article quoted above (which I wrote), provides a 
bit more discussion.

Tom

Re: Did you mean?

Posted by Chris Lu <ch...@gmail.com>.
Constructing a separated index as a dictionary is one part of solution.

The other part is to construct a dictionary with a list of possible
"good words".
By "good words", I mean all leagal queries, not necessarily "correct words".
Two approaches I can think of:
* Use a word list(it may not be the word list you want, but it is just
a compromise).
* Analyze your original index, listing out all words inside.

There should be other approaches. Anyone?

-- 
Chris Lu
------------
Lucene Search RAD on Any Database
http://www.dbsight.net

On 8/29/05, Joseph B. Ottinger <jo...@enigmastation.com> wrote:
> java.net had an article on this not long ago. See
> http://today.java.net/pub/a/today/2005/08/09/didyoumean.html .
> 
> On Mon, 29 Aug 2005, Martin Rode wrote:
> 
> > Hi everybody,
> >
> > Has anyone tried to code a solution like Google's "Did you mean?" in Lucene?
> >
> > I would be very happy to hear your ideas, approaches, suggestions.
> >
> > Best,
> > Martin
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> -----------------------------------------------------------------------
> Joseph B. Ottinger                             http://enigmastation.com
> Editor, http://www.TheServerSide.com             joeo@enigmastation.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Did you mean?

Posted by "Joseph B. Ottinger" <jo...@enigmastation.com>.
java.net had an article on this not long ago. See 
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html .

On Mon, 29 Aug 2005, Martin Rode wrote:

> Hi everybody,
>
> Has anyone tried to code a solution like Google's "Did you mean?" in Lucene?
>
> I would be very happy to hear your ideas, approaches, suggestions.
>
> Best,
> Martin
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-----------------------------------------------------------------------
Joseph B. Ottinger                             http://enigmastation.com
Editor, http://www.TheServerSide.com             joeo@enigmastation.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org