You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marco <sp...@libero.it> on 2007/10/30 17:46:39 UTC

EdgeNGramTokenizer

Hi all,
I'm following the suggestion of this forum on how create a suggestion 
service like google suggest.
I'm parsing a words/rank file with my words.
For each word, I'm adding a document with content and rank in in index: 
then I create a EdgeNGramTokenizer of the word. This gives me N words 
and I add each one in a new field of the index.
So for each word in the index I have:
- content (for example apple)
- rank
- X initial letters of the word (for example app)
- X+1 initial letters of the word (for example appl)
- X+2 initial letters of the word (for example apple)
....
- Y initial letters of the word (for example app)

I add each additional field in the index as:

doc.add(new Field(name_gram, tn.termText(), Field.Store.NO, 
Field.Index.UN_TOKENIZED));

When an user search a word, for example app, I have to query the index 
the right field.
I calculate the length of the input and the I query the right field and 
the I get the content of the docs.
Is it ok?
For search I use:

String field_ok = field + line.length(); 
QueryParser parser = new QueryParser(field_ok, analyzer);
Query query = parser.parse(line);
System.out.println("Searching for: " + query.toString(field_ok));
Hits hits = searcher.search(query, new Sort("rank", true));


All is ok if the input of the user doesn't contain spaces...
Am I missing something?
Best regards

Marco


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: EdgeNGramTokenizer

Posted by h t <bl...@gmail.com>.
http://www.shifttab.cn:8001/wiki

2007/10/31, Marco <sp...@libero.it>:
>
> It seems that the problem is when I add the token created by
> EdgeNGramTokenizer in in the index.
> If the token contains a space (for example apple com) I have to add to
> the index with Field.Index.TOKENIZED otherwise the search cannot find it.
> If there is no space there is no problem even if I use
> Field.Index.UN_TOKENIZED.
> So I'd like to know if there is speed search difference insertinf a
> field as TOKENIZED or UN_TOKENIZED.
> Best regards
>
>
>
> Marco ha scritto:
> > Hi all,
> > I'm following the suggestion of this forum on how create a suggestion
> > service like google suggest.
> > I'm parsing a words/rank file with my words.
> > For each word, I'm adding a document with content and rank in in
> > index: then I create a EdgeNGramTokenizer of the word. This gives me N
> > words and I add each one in a new field of the index.
> > So for each word in the index I have:
> > - content (for example apple)
> > - rank
> > - X initial letters of the word (for example app)
> > - X+1 initial letters of the word (for example appl)
> > - X+2 initial letters of the word (for example apple)
> > ....
> > - Y initial letters of the word (for example app)
> >
> > I add each additional field in the index as:
> >
> > doc.add(new Field(name_gram, tn.termText(), Field.Store.NO,
> > Field.Index.UN_TOKENIZED));
> >
> > When an user search a word, for example app, I have to query the index
> > the right field.
> > I calculate the length of the input and the I query the right field
> > and the I get the content of the docs.
> > Is it ok?
> > For search I use:
> >
> > String field_ok = field + line.length(); QueryParser parser = new
> > QueryParser(field_ok, analyzer);
> > Query query = parser.parse(line);
> > System.out.println("Searching for: " + query.toString(field_ok));
> > Hits hits = searcher.search(query, new Sort("rank", true));
> >
> >
> > All is ok if the input of the user doesn't contain spaces...
> > Am I missing something?
> > Best regards
> >
> > Marco
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: EdgeNGramTokenizer

Posted by Marco <sp...@libero.it>.
It seems that the problem is when I add the token created by 
EdgeNGramTokenizer in in the index.
If the token contains a space (for example apple com) I have to add to 
the index with Field.Index.TOKENIZED otherwise the search cannot find it.
If there is no space there is no problem even if I use 
Field.Index.UN_TOKENIZED.
So I'd like to know if there is speed search difference insertinf a 
field as TOKENIZED or UN_TOKENIZED.
Best regards



Marco ha scritto:
> Hi all,
> I'm following the suggestion of this forum on how create a suggestion 
> service like google suggest.
> I'm parsing a words/rank file with my words.
> For each word, I'm adding a document with content and rank in in 
> index: then I create a EdgeNGramTokenizer of the word. This gives me N 
> words and I add each one in a new field of the index.
> So for each word in the index I have:
> - content (for example apple)
> - rank
> - X initial letters of the word (for example app)
> - X+1 initial letters of the word (for example appl)
> - X+2 initial letters of the word (for example apple)
> ....
> - Y initial letters of the word (for example app)
>
> I add each additional field in the index as:
>
> doc.add(new Field(name_gram, tn.termText(), Field.Store.NO, 
> Field.Index.UN_TOKENIZED));
>
> When an user search a word, for example app, I have to query the index 
> the right field.
> I calculate the length of the input and the I query the right field 
> and the I get the content of the docs.
> Is it ok?
> For search I use:
>
> String field_ok = field + line.length(); QueryParser parser = new 
> QueryParser(field_ok, analyzer);
> Query query = parser.parse(line);
> System.out.println("Searching for: " + query.toString(field_ok));
> Hits hits = searcher.search(query, new Sort("rank", true));
>
>
> All is ok if the input of the user doesn't contain spaces...
> Am I missing something?
> Best regards
>
> Marco
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org