You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by yin <yi...@AI.SRI.COM> on 2008/01/05 01:43:24 UTC
Strange Indexing Problem with letter-number combination
Hello there!
I see a very strange indexing problem that I hope someone can shed a light
on.
I have a StandardAnalyzer (the default one, no special configurations), it
works great until it hits a file that contains a letter-number combination
word such as "alison29". I checked the index with Luke and here's the
strange thing:
For text "how are you", I got three index entries as "how", "are", and
"you", while as for text "hello alison20 there", I got only one index entry
as "hello,alison29,there", as a consequence, none of the searches for
"alison29", for "hello", or for "there" returns anything, it only works if I
search precisely for "hello,alison29,there".
I can pad both my index and search keyword but not very comfortable about
it, and I feel the issue is too obvious to be a overlooked bug, more likely
I missed something, perhaps some parameter setting in Lucene
StandardAnalyzer? Any idea? Thank you very much for your help!
Regards,
Min
RE: Strange Indexing Problem with letter-number combination
Posted by DIGY <di...@gmail.com>.
Hi Min,
Try other Analyzers( such as WhitespaceAnalyzer).
DIGY
-----Original Message-----
From: Min Yin [mailto:yin@AI.SRI.COM]
Sent: Wednesday, January 09, 2008 2:38 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: Strange Indexing Problem with letter-number combination
Hello,
Thanks for the reply! I've found that the problem is caused by the
commas that separate different words, if I change the commas to spaces
or semi-colons, then it works fine. Comma also works as long as you
don't have any digits in the word. Maybe it has something to do with
"10,000" or that sort?
And I have a second question that somewhat related, if I have text
"deskbar-abc" indexed, it will be indexed as "deskbar" and "abc", but if
I have "deskbar-abc288" instead, it will be treated as one word. Is
there a way to make it work consistently? For example, always keep the
dash and do not split the word?
Many thanks in advance!
Min
DIGY wrote:
> 1.
> I tried your case with the following code and everything worked as
expected.
>
> Test(new Lucene.Net.Analysis.Standard.StandardAnalyzer(), "hello
> alison20 there", "alison20");
>
> void Test(Lucene.Net.Analysis.Analyzer analyzer, string
> stringToIndex, string stringToSearch)
> {
> Lucene.Net.Store.RAMDirectory dir = new
> Lucene.Net.Store.RAMDirectory();
> Lucene.Net.Index.IndexWriter writer = new
> Lucene.Net.Index.IndexWriter(dir, analyzer);
> Lucene.Net.Documents.Document doc = new
> Lucene.Net.Documents.Document();
> Lucene.Net.Documents.Field field = new
> Lucene.Net.Documents.Field("field1", stringToIndex,
> Lucene.Net.Documents.Field.Store.YES,
> Lucene.Net.Documents.Field.Index.TOKENIZED);
> doc.Add(field);
> writer.AddDocument(doc);
> writer.Close();
>
> Lucene.Net.Search.IndexSearcher searcher = new
> Lucene.Net.Search.IndexSearcher(dir);
> Lucene.Net.QueryParsers.QueryParser qp = new
> Lucene.Net.QueryParsers.QueryParser("field1", analyzer);
> Lucene.Net.Search.Query q = qp.Parse(stringToSearch);
> Lucene.Net.Search.Hits hits = searcher.Search(q);
> Console.WriteLine(hits.Length().ToString() + " hit(s)");
> }
>
>
> 2.
> Using StandardAnalyzer, tokens of string "hello alison20 there" are
"hello"
> and "alison20"( as expected ).
>
> TokenizeString(new Lucene.Net.Analysis.Standard.StandardAnalyzer() ,
> "hello alison20 there");
>
> void TokenizeString(Lucene.Net.Analysis.Analyzer analyzer, string s)
> {
> Lucene.Net.Analysis.TokenStream ts = analyzer.TokenStream("",
> new System.IO.StringReader(s));
> for (Lucene.Net.Analysis.Token t = ts.Next(); t != null; t =
> ts.Next())
> {
> Console.WriteLine(t.TermText() + " " + t.Type());
> }
> }
>
>
> DIGY
>
> -----Original Message-----
> From: yin [mailto:yin@AI.SRI.COM]
> Sent: Saturday, January 05, 2008 2:43 AM
> To: lucene-net-user@incubator.apache.org
> Subject: Strange Indexing Problem with letter-number combination
>
> Hello there!
>
>
>
> I see a very strange indexing problem that I hope someone can shed a light
> on.
>
>
>
> I have a StandardAnalyzer (the default one, no special configurations), it
> works great until it hits a file that contains a letter-number combination
> word such as "alison29". I checked the index with Luke and here's the
> strange thing:
>
>
>
> For text "how are you", I got three index entries as "how", "are", and
> "you", while as for text "hello alison20 there", I got only one index
entry
> as "hello,alison29,there", as a consequence, none of the searches for
> "alison29", for "hello", or for "there" returns anything, it only works if
I
> search precisely for "hello,alison29,there".
>
>
>
> I can pad both my index and search keyword but not very comfortable about
> it, and I feel the issue is too obvious to be a overlooked bug, more
likely
> I missed something, perhaps some parameter setting in Lucene
> StandardAnalyzer? Any idea? Thank you very much for your help!
>
>
>
> Regards,
>
> Min
>
>
Re: Strange Indexing Problem with letter-number combination
Posted by Min Yin <yi...@AI.SRI.COM>.
Hello,
Thanks for the reply! I've found that the problem is caused by the
commas that separate different words, if I change the commas to spaces
or semi-colons, then it works fine. Comma also works as long as you
don't have any digits in the word. Maybe it has something to do with
"10,000" or that sort?
And I have a second question that somewhat related, if I have text
"deskbar-abc" indexed, it will be indexed as "deskbar" and "abc", but if
I have "deskbar-abc288" instead, it will be treated as one word. Is
there a way to make it work consistently? For example, always keep the
dash and do not split the word?
Many thanks in advance!
Min
DIGY wrote:
> 1.
> I tried your case with the following code and everything worked as expected.
>
> Test(new Lucene.Net.Analysis.Standard.StandardAnalyzer(), "hello
> alison20 there", "alison20");
>
> void Test(Lucene.Net.Analysis.Analyzer analyzer, string
> stringToIndex, string stringToSearch)
> {
> Lucene.Net.Store.RAMDirectory dir = new
> Lucene.Net.Store.RAMDirectory();
> Lucene.Net.Index.IndexWriter writer = new
> Lucene.Net.Index.IndexWriter(dir, analyzer);
> Lucene.Net.Documents.Document doc = new
> Lucene.Net.Documents.Document();
> Lucene.Net.Documents.Field field = new
> Lucene.Net.Documents.Field("field1", stringToIndex,
> Lucene.Net.Documents.Field.Store.YES,
> Lucene.Net.Documents.Field.Index.TOKENIZED);
> doc.Add(field);
> writer.AddDocument(doc);
> writer.Close();
>
> Lucene.Net.Search.IndexSearcher searcher = new
> Lucene.Net.Search.IndexSearcher(dir);
> Lucene.Net.QueryParsers.QueryParser qp = new
> Lucene.Net.QueryParsers.QueryParser("field1", analyzer);
> Lucene.Net.Search.Query q = qp.Parse(stringToSearch);
> Lucene.Net.Search.Hits hits = searcher.Search(q);
> Console.WriteLine(hits.Length().ToString() + " hit(s)");
> }
>
>
> 2.
> Using StandardAnalyzer, tokens of string "hello alison20 there" are "hello"
> and "alison20"( as expected ).
>
> TokenizeString(new Lucene.Net.Analysis.Standard.StandardAnalyzer() ,
> "hello alison20 there");
>
> void TokenizeString(Lucene.Net.Analysis.Analyzer analyzer, string s)
> {
> Lucene.Net.Analysis.TokenStream ts = analyzer.TokenStream("",
> new System.IO.StringReader(s));
> for (Lucene.Net.Analysis.Token t = ts.Next(); t != null; t =
> ts.Next())
> {
> Console.WriteLine(t.TermText() + " " + t.Type());
> }
> }
>
>
> DIGY
>
> -----Original Message-----
> From: yin [mailto:yin@AI.SRI.COM]
> Sent: Saturday, January 05, 2008 2:43 AM
> To: lucene-net-user@incubator.apache.org
> Subject: Strange Indexing Problem with letter-number combination
>
> Hello there!
>
>
>
> I see a very strange indexing problem that I hope someone can shed a light
> on.
>
>
>
> I have a StandardAnalyzer (the default one, no special configurations), it
> works great until it hits a file that contains a letter-number combination
> word such as "alison29". I checked the index with Luke and here's the
> strange thing:
>
>
>
> For text "how are you", I got three index entries as "how", "are", and
> "you", while as for text "hello alison20 there", I got only one index entry
> as "hello,alison29,there", as a consequence, none of the searches for
> "alison29", for "hello", or for "there" returns anything, it only works if I
> search precisely for "hello,alison29,there".
>
>
>
> I can pad both my index and search keyword but not very comfortable about
> it, and I feel the issue is too obvious to be a overlooked bug, more likely
> I missed something, perhaps some parameter setting in Lucene
> StandardAnalyzer? Any idea? Thank you very much for your help!
>
>
>
> Regards,
>
> Min
>
>
RE: Strange Indexing Problem with letter-number combination
Posted by DIGY <di...@gmail.com>.
1.
I tried your case with the following code and everything worked as expected.
Test(new Lucene.Net.Analysis.Standard.StandardAnalyzer(), "hello
alison20 there", "alison20");
void Test(Lucene.Net.Analysis.Analyzer analyzer, string
stringToIndex, string stringToSearch)
{
Lucene.Net.Store.RAMDirectory dir = new
Lucene.Net.Store.RAMDirectory();
Lucene.Net.Index.IndexWriter writer = new
Lucene.Net.Index.IndexWriter(dir, analyzer);
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field field = new
Lucene.Net.Documents.Field("field1", stringToIndex,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED);
doc.Add(field);
writer.AddDocument(doc);
writer.Close();
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
Lucene.Net.QueryParsers.QueryParser qp = new
Lucene.Net.QueryParsers.QueryParser("field1", analyzer);
Lucene.Net.Search.Query q = qp.Parse(stringToSearch);
Lucene.Net.Search.Hits hits = searcher.Search(q);
Console.WriteLine(hits.Length().ToString() + " hit(s)");
}
2.
Using StandardAnalyzer, tokens of string "hello alison20 there" are "hello"
and "alison20"( as expected ).
TokenizeString(new Lucene.Net.Analysis.Standard.StandardAnalyzer() ,
"hello alison20 there");
void TokenizeString(Lucene.Net.Analysis.Analyzer analyzer, string s)
{
Lucene.Net.Analysis.TokenStream ts = analyzer.TokenStream("",
new System.IO.StringReader(s));
for (Lucene.Net.Analysis.Token t = ts.Next(); t != null; t =
ts.Next())
{
Console.WriteLine(t.TermText() + " " + t.Type());
}
}
DIGY
-----Original Message-----
From: yin [mailto:yin@AI.SRI.COM]
Sent: Saturday, January 05, 2008 2:43 AM
To: lucene-net-user@incubator.apache.org
Subject: Strange Indexing Problem with letter-number combination
Hello there!
I see a very strange indexing problem that I hope someone can shed a light
on.
I have a StandardAnalyzer (the default one, no special configurations), it
works great until it hits a file that contains a letter-number combination
word such as "alison29". I checked the index with Luke and here's the
strange thing:
For text "how are you", I got three index entries as "how", "are", and
"you", while as for text "hello alison20 there", I got only one index entry
as "hello,alison29,there", as a consequence, none of the searches for
"alison29", for "hello", or for "there" returns anything, it only works if I
search precisely for "hello,alison29,there".
I can pad both my index and search keyword but not very comfortable about
it, and I feel the issue is too obvious to be a overlooked bug, more likely
I missed something, perhaps some parameter setting in Lucene
StandardAnalyzer? Any idea? Thank you very much for your help!
Regards,
Min