You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2007/08/17 02:11:31 UTC

[jira] Updated: (LUCENE-626) Extended spell checker with phrase support and adaptive user session analysis.

     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-626:
-------------------------------

    Attachment: LUCENE-626_20070817.patch

As the phrase-suggestion layer on top of contrib/spell in this patch was noted in a bunch of forums the last weeks, I've removed the 550-dependency and brought it up to date with the trunk. 

Second level suggesting (ngram token, phrase) can run stand alone. See TestTokenPhraseSuggester. However, I recommend the adaptive dictonary as it will act as a cache on top of second level suggestions. (See docs.)

Output from using adaptive layer only, i.e. suggestions based on how users previously behaved. About half a million user queries analyed to build the dictionary (takes 30 seconds to build on my dual core):

3ms	 pirates ofthe caribbean -> pirates of the caribbean
2ms	 pirates of the carribbean -> pirates of the caribbean
0ms	 pirates carricean -> pirates caribbean
1ms	 pirates of the carriben -> pirates of the caribbean
0ms	 pirates of the carabien -> pirates of the caribbean
0ms	 pirates of the carabbean -> pirates of the caribbean
1ms	 pirates og carribean -> pirates of the caribbean
0ms	 pirates of the caribbean music -> pirates of the caribbean soundtrack
0ms	 pirates of the caribbean soundtrack -> pirates of the caribbean score
0ms	 pirate of carabian -> pirate of caribbean
0ms	 pirate of caribbean -> pirates of caribbean
0ms	 pirates of caribbean -> pirates of caribbean
0ms	 homm 4 -> homm iv
0ms	 the pilates -> null


Using the phrase ngram token suggestion using token matrices checked against an apriori index. A lot of queries required for one suggestion. Instantiated index as apriori saves plenty of millis. This is  expensive stuff, but works pretty good. 

72ms	 the pilates -> the pirates
440ms	 heroes of fight and magic -> heroes of might and magic
417ms	 heroes of right and magic -> heroes of might and magic
383ms	 heroes of magic and light -> heroes of might and magic
20ms	 heroesof lightand magik -> null
385ms	 heroes of light and magik -> heroes of might and magic
0ms	 heroesof lightand magik -> heroes of might and magic
385ms	 heroes of magic and might -> heroes of might and magic 

(That 0ms is becase previous was cached. One does not have to use this cache.)

> Extended spell checker with phrase support and adaptive user session analysis.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: https://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: didyoumean.patch.bz2, LUCENE-626_20070817.patch, spellchecker.diff
>
>
> Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
> The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions  and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
> Example:
> {code:java}
> public void testImportData() throws Exception {
>     // load 200 000 user queries with session data and time stamp. no goals specified.
>     System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
>     importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
>     System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
>     importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
>     System.out.println("Done.");
>     // run some tests without the second level suggestions,
>     // i.e. user behavioral data only. no ngrams or so.
>     
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
>     assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
>     assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
>     assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
>     assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
>     assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
>     assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
>     // depening on how many hits and goals are noted with these two queries
>     // perhaps the delta should be added to a synonym dictionary? 
>     assertEquals("homm iv", facade.didYouMean("homm 4"));
>     // not yet known.. and we have no second level yet.
>     assertNull(facade.didYouMean("the pilates"));
>     // use the dictionary built from user queries to build the token phrase and ngram suggester.      
>     facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
>     // now it's learned
>     assertEquals("the pirates", facade.didYouMean("the pilates"));
>     // typos
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
>     // composite dictionary key not learned yet..
>     assertEquals(null, facade.didYouMean("heroesof lightand magik"));
>     // learn
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
>     // test
>     assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
>     // wrong term order
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org