You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mansour Al Akeel <ma...@gmail.com> on 2012/06/23 00:26:18 UTC

StandardTokenizer and split tokens

Hello all,

I am tying to write a simple autosuggest functionality. I was looking
at some auto suggest code, and came over this post
http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene
I have been stuck with the some strange words, trying to see how they
are generated. Here's the Anayzer:

public class AutoCompleteAnalyzer extends Analyzer {
	public TokenStream tokenStream(String fieldName, Reader reader) {
		TokenStream result = null;
		result = new StandardTokenizer(Version.LUCENE_36, reader);
		result = new EdgeNGramTokenFilter(result,	EdgeNGramTokenFilter.Side.FRONT,
1, 20);
		return result;
	}	
}

And this is the relevant method that does the indexing. It's being
called with reindexOn("title");

private void reindexOn(String keyword) throws CorruptIndexException,
			IOException {
		log.info("indexing on " + keyword);
		Analyzer analyzer = new AutoCompleteAnalyzer();
		IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36,
	analyzer);
		IndexWriter analyticalWriter = new IndexWriter(suggestIndexDirectory, config);
		analyticalWriter.commit(); // needed to create the initiale index
		IndexReader indexReader = IndexReader.open(productsIndexDirectory);
		Map<String, Integer> wordsMap = new HashMap<String, Integer>();
		LuceneDictionary dict = new LuceneDictionary(indexReader, keyword);
		BytesRefIterator iter = dict.getWordsIterator();
		BytesRef ref = null;
		while ((ref = iter.next()) != null) {
			String word = new String(ref.bytes);
			int len = word.length();
			if (len < 3) {
				continue;
			}
			if (wordsMap.containsKey(word)) {
				String msg = "Word " + word + " Already Exists";
				throw new IllegalStateException(msg);
			}
			wordsMap.put(word, indexReader.docFreq(new Term(keyword, word)));
		}

		for (String word : wordsMap.keySet()) {
			Document doc = new Document();
			Field field = null;
			field = new Field(SOURCE_WORD_FIELD, word, Field.Store.YES,
Field.Index.NOT_ANALYZED);
			doc.add(field);
			field = new Field(GRAMMED_WORDS_FIELD, word,
Field.Store.YES,	Field.Index.ANALYZED);
			doc.add(field);
			String count = Integer.toString(wordsMap.get(word));
			field = new Field(COUNT_FIELD, count, Field.Store.NO,
Field.Index.NOT_ANALYZED); // count
			doc.add(field);
			analyticalWriter.addDocument(doc);
		}
		analyticalWriter.commit();
		analyticalWriter.close();
		indexReader.close();
	}

	private static final String GRAMMED_WORDS_FIELD = "words";
	private static final String SOURCE_WORD_FIELD = "sourceWord";
	private static final String COUNT_FIELD = "count";

And now, my unit testing :

	@BeforeClass
	public static void setUp() throws CorruptIndexException, IOException {
		String idxFileName = "myIndexDirectory";
		Indexer indexer = new Indexer(idxFileName);
		indexer.addDoc("Apache Lucene in Action");
		indexer.addDoc("Lord of the Rings");
		indexer.addDoc("Apache Solr in Action");
		indexer.addDoc("apples and Oranges");
		indexer.addDoc("apple iphone");
		indexer.reindexKeywords();
		search = new SearchEngine(idxFileName);
	}

The strange part, is looking under the index I found there are
sourceWords (lordne, applee, solres ). I understand that the ngram
will result in parts of each word. Ex:

l
lo
lor
lord

But of these go into one field, but what about "lorden" and "solres"
?? I checked the docs for this, and looked into Jira, but didn't find
relevant info.
Is there something I am missing ??

I understand there could be easier ways to create this functionality
(http://wiki.apache.org/lucene-java/SpellChecker), but I like to
resolve this issue, and to
understand if I am doing something wrong.

Thank you in advance.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: StandardTokenizer and split tokens

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

No problem! I also updated the JavaDocs in trunk, 4.x and 3.6.1 to prevent
this wrong usage (missing offset, count, charset).

I am glad that I was able to assist you!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Mansour Al Akeel [mailto:mansour.alakeel@gmail.com]
> Sent: Saturday, June 23, 2012 11:21 PM
> To: java-user@lucene.apache.org
> Subject: Re: StandardTokenizer and split tokens
> 
> Uwe,
> thank you for the advice. I updated my code.
> 
> 
> On Sat, Jun 23, 2012 at 3:15 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> >> I found the main issue.
> >> I was using ByteRef without the length. This fixed the problem.
> >>
> >>                       String word = new
> > String(ref.bytes,ref.offset,ref.length);
> >
> > Please see my other mail, using no character set here is the second
> > problem of your code, this is the correct way to do:
> >
> > String word = ref.utf8ToString();
> >
> > Uwe
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: StandardTokenizer and split tokens

Posted by Mansour Al Akeel <ma...@gmail.com>.
Uwe,
thank you for the advice. I updated my code.


On Sat, Jun 23, 2012 at 3:15 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> I found the main issue.
>> I was using ByteRef without the length. This fixed the problem.
>>
>>                       String word = new
> String(ref.bytes,ref.offset,ref.length);
>
> Please see my other mail, using no character set here is the second problem
> of your code, this is the correct way to do:
>
> String word = ref.utf8ToString();
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: StandardTokenizer and split tokens

Posted by Uwe Schindler <uw...@thetaphi.de>.
> I found the main issue.
> I was using ByteRef without the length. This fixed the problem.
> 
> 			String word = new
String(ref.bytes,ref.offset,ref.length);

Please see my other mail, using no character set here is the second problem
of your code, this is the correct way to do:

String word = ref.utf8ToString();

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: StandardTokenizer and split tokens

Posted by Mansour Al Akeel <ma...@gmail.com>.
I found the main issue.
I was using ByteRef without the length. This fixed the problem.

			String word = new String(ref.bytes,ref.offset,ref.length);


Thank you.

On Fri, Jun 22, 2012 at 6:26 PM, Mansour Al Akeel
<ma...@gmail.com> wrote:
> Hello all,
>
> I am tying to write a simple autosuggest functionality. I was looking
> at some auto suggest code, and came over this post
> http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene
> I have been stuck with the some strange words, trying to see how they
> are generated. Here's the Anayzer:
>
> public class AutoCompleteAnalyzer extends Analyzer {
>        public TokenStream tokenStream(String fieldName, Reader reader) {
>                TokenStream result = null;
>                result = new StandardTokenizer(Version.LUCENE_36, reader);
>                result = new EdgeNGramTokenFilter(result,       EdgeNGramTokenFilter.Side.FRONT,
> 1, 20);
>                return result;
>        }
> }
>
> And this is the relevant method that does the indexing. It's being
> called with reindexOn("title");
>
> private void reindexOn(String keyword) throws CorruptIndexException,
>                        IOException {
>                log.info("indexing on " + keyword);
>                Analyzer analyzer = new AutoCompleteAnalyzer();
>                IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36,
>        analyzer);
>                IndexWriter analyticalWriter = new IndexWriter(suggestIndexDirectory, config);
>                analyticalWriter.commit(); // needed to create the initiale index
>                IndexReader indexReader = IndexReader.open(productsIndexDirectory);
>                Map<String, Integer> wordsMap = new HashMap<String, Integer>();
>                LuceneDictionary dict = new LuceneDictionary(indexReader, keyword);
>                BytesRefIterator iter = dict.getWordsIterator();
>                BytesRef ref = null;
>                while ((ref = iter.next()) != null) {
>                        String word = new String(ref.bytes);
>                        int len = word.length();
>                        if (len < 3) {
>                                continue;
>                        }
>                        if (wordsMap.containsKey(word)) {
>                                String msg = "Word " + word + " Already Exists";
>                                throw new IllegalStateException(msg);
>                        }
>                        wordsMap.put(word, indexReader.docFreq(new Term(keyword, word)));
>                }
>
>                for (String word : wordsMap.keySet()) {
>                        Document doc = new Document();
>                        Field field = null;
>                        field = new Field(SOURCE_WORD_FIELD, word, Field.Store.YES,
> Field.Index.NOT_ANALYZED);
>                        doc.add(field);
>                        field = new Field(GRAMMED_WORDS_FIELD, word,
> Field.Store.YES,        Field.Index.ANALYZED);
>                        doc.add(field);
>                        String count = Integer.toString(wordsMap.get(word));
>                        field = new Field(COUNT_FIELD, count, Field.Store.NO,
> Field.Index.NOT_ANALYZED); // count
>                        doc.add(field);
>                        analyticalWriter.addDocument(doc);
>                }
>                analyticalWriter.commit();
>                analyticalWriter.close();
>                indexReader.close();
>        }
>
>        private static final String GRAMMED_WORDS_FIELD = "words";
>        private static final String SOURCE_WORD_FIELD = "sourceWord";
>        private static final String COUNT_FIELD = "count";
>
> And now, my unit testing :
>
>        @BeforeClass
>        public static void setUp() throws CorruptIndexException, IOException {
>                String idxFileName = "myIndexDirectory";
>                Indexer indexer = new Indexer(idxFileName);
>                indexer.addDoc("Apache Lucene in Action");
>                indexer.addDoc("Lord of the Rings");
>                indexer.addDoc("Apache Solr in Action");
>                indexer.addDoc("apples and Oranges");
>                indexer.addDoc("apple iphone");
>                indexer.reindexKeywords();
>                search = new SearchEngine(idxFileName);
>        }
>
> The strange part, is looking under the index I found there are
> sourceWords (lordne, applee, solres ). I understand that the ngram
> will result in parts of each word. Ex:
>
> l
> lo
> lor
> lord
>
> But of these go into one field, but what about "lorden" and "solres"
> ?? I checked the docs for this, and looked into Jira, but didn't find
> relevant info.
> Is there something I am missing ??
>
> I understand there could be easier ways to create this functionality
> (http://wiki.apache.org/lucene-java/SpellChecker), but I like to
> resolve this issue, and to
> understand if I am doing something wrong.
>
> Thank you in advance.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: StandardTokenizer and split tokens

Posted by Uwe Schindler <uw...@thetaphi.de>.
Don't ever do this:

String word = new String(ref.bytes);

This has following problems:
- ignores character set!!! (in general: never ever use new String(byte[])
without specifying the 2nd charset parameter!). byte[] != String. Depending
on the default charset on your computer this would return bullshit
- ignores length
- ignores offset

Use the following code to convert a UTF-8 encoded BytesRef to a String:

String word = ref.utf8ToString()

Thanks :-)

P.S.: I posted this here because I want to prevent that the code you posted
gets used by anybody else

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Mansour Al Akeel [mailto:mansour.alakeel@gmail.com]
> Sent: Saturday, June 23, 2012 12:26 AM
> To: java-user@lucene.apache.org
> Subject: StandardTokenizer and split tokens
> 
> Hello all,
> 
> I am tying to write a simple autosuggest functionality. I was looking at
some
> auto suggest code, and came over this post
> http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-
> suggestions-in-lucene
> I have been stuck with the some strange words, trying to see how they are
> generated. Here's the Anayzer:
> 
> public class AutoCompleteAnalyzer extends Analyzer {
> 	public TokenStream tokenStream(String fieldName, Reader reader) {
> 		TokenStream result = null;
> 		result = new StandardTokenizer(Version.LUCENE_36, reader);
> 		result = new EdgeNGramTokenFilter(result,
> 	EdgeNGramTokenFilter.Side.FRONT,
> 1, 20);
> 		return result;
> 	}
> }
> 
> And this is the relevant method that does the indexing. It's being called
with
> reindexOn("title");
> 
> private void reindexOn(String keyword) throws CorruptIndexException,
> 			IOException {
> 		log.info("indexing on " + keyword);
> 		Analyzer analyzer = new AutoCompleteAnalyzer();
> 		IndexWriterConfig config = new
> IndexWriterConfig(Version.LUCENE_36,
> 	analyzer);
> 		IndexWriter analyticalWriter = new
> IndexWriter(suggestIndexDirectory, config);
> 		analyticalWriter.commit(); // needed to create the initiale
> index
> 		IndexReader indexReader =
> IndexReader.open(productsIndexDirectory);
> 		Map<String, Integer> wordsMap = new HashMap<String,
> Integer>();
> 		LuceneDictionary dict = new LuceneDictionary(indexReader,
> keyword);
> 		BytesRefIterator iter = dict.getWordsIterator();
> 		BytesRef ref = null;
> 		while ((ref = iter.next()) != null) {
> 			String word = new String(ref.bytes);
> 			int len = word.length();
> 			if (len < 3) {
> 				continue;
> 			}
> 			if (wordsMap.containsKey(word)) {
> 				String msg = "Word " + word + " Already
> Exists";
> 				throw new IllegalStateException(msg);
> 			}
> 			wordsMap.put(word, indexReader.docFreq(new
> Term(keyword, word)));
> 		}
> 
> 		for (String word : wordsMap.keySet()) {
> 			Document doc = new Document();
> 			Field field = null;
> 			field = new Field(SOURCE_WORD_FIELD, word,
> Field.Store.YES, Field.Index.NOT_ANALYZED);
> 			doc.add(field);
> 			field = new Field(GRAMMED_WORDS_FIELD, word,
> Field.Store.YES,	Field.Index.ANALYZED);
> 			doc.add(field);
> 			String count = Integer.toString(wordsMap.get(word));
> 			field = new Field(COUNT_FIELD, count,
Field.Store.NO,
> Field.Index.NOT_ANALYZED); // count
> 			doc.add(field);
> 			analyticalWriter.addDocument(doc);
> 		}
> 		analyticalWriter.commit();
> 		analyticalWriter.close();
> 		indexReader.close();
> 	}
> 
> 	private static final String GRAMMED_WORDS_FIELD = "words";
> 	private static final String SOURCE_WORD_FIELD = "sourceWord";
> 	private static final String COUNT_FIELD = "count";
> 
> And now, my unit testing :
> 
> 	@BeforeClass
> 	public static void setUp() throws CorruptIndexException, IOException
{
> 		String idxFileName = "myIndexDirectory";
> 		Indexer indexer = new Indexer(idxFileName);
> 		indexer.addDoc("Apache Lucene in Action");
> 		indexer.addDoc("Lord of the Rings");
> 		indexer.addDoc("Apache Solr in Action");
> 		indexer.addDoc("apples and Oranges");
> 		indexer.addDoc("apple iphone");
> 		indexer.reindexKeywords();
> 		search = new SearchEngine(idxFileName);
> 	}
> 
> The strange part, is looking under the index I found there are sourceWords
> (lordne, applee, solres ). I understand that the ngram will result in
parts of each
> word. Ex:
> 
> l
> lo
> lor
> lord
> 
> But of these go into one field, but what about "lorden" and "solres"
> ?? I checked the docs for this, and looked into Jira, but didn't find
relevant info.
> Is there something I am missing ??
> 
> I understand there could be easier ways to create this functionality
> (http://wiki.apache.org/lucene-java/SpellChecker), but I like to resolve
this
> issue, and to understand if I am doing something wrong.
> 
> Thank you in advance.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org