You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by yin <yi...@AI.SRI.COM> on 2008/01/05 01:43:24 UTC

Strange Indexing Problem with letter-number combination

Hello there!

 

I see a very strange indexing problem that I hope someone can shed a light
on.

 

I have a StandardAnalyzer (the default one, no special configurations), it
works great until it hits a file that contains a letter-number combination
word such as "alison29". I checked the index with Luke and here's the
strange thing:

 

For text "how are you", I got three index entries as "how", "are", and
"you", while as for text "hello alison20 there", I got only one index entry
as "hello,alison29,there", as a consequence, none of the searches for
"alison29", for "hello", or for "there" returns anything, it only works if I
search precisely for "hello,alison29,there". 

 

I can pad both my index and search keyword but not very comfortable about
it, and I feel the issue is too obvious to be a overlooked bug, more likely
I missed something, perhaps some parameter setting in Lucene
StandardAnalyzer? Any idea? Thank you very much for your help!

 

Regards,

Min

RE: Strange Indexing Problem with letter-number combination

Posted by DIGY <di...@gmail.com>.

Hi Min,

Try other Analyzers( such as WhitespaceAnalyzer). 

DIGY

-----Original Message-----
From: Min Yin [mailto:yin@AI.SRI.COM] 
Sent: Wednesday, January 09, 2008 2:38 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: Strange Indexing Problem with letter-number combination

Hello,

Thanks for the reply! I've found that the problem is caused by the 
commas that separate different words, if I change the commas to spaces 
or semi-colons, then it works fine. Comma also works as long as you 
don't have any digits in the word. Maybe it has something to do with 
"10,000" or that sort?

And I have a second question that somewhat related, if I have text 
"deskbar-abc" indexed, it will be indexed as "deskbar" and "abc", but if 
I have "deskbar-abc288" instead, it will be treated as one word. Is 
there a way to make it work consistently? For example, always keep the 
dash and do not split the word?

Many thanks in advance!
Min

DIGY wrote:
> 1.
> I tried your case with the following code and everything worked as
expected.
>
>       Test(new Lucene.Net.Analysis.Standard.StandardAnalyzer(), "hello
> alison20 there", "alison20");
>
> 	void Test(Lucene.Net.Analysis.Analyzer analyzer, string
> stringToIndex, string stringToSearch)
>         {
>             Lucene.Net.Store.RAMDirectory dir = new
> Lucene.Net.Store.RAMDirectory();
>             Lucene.Net.Index.IndexWriter writer = new
> Lucene.Net.Index.IndexWriter(dir, analyzer);
>             Lucene.Net.Documents.Document doc = new
> Lucene.Net.Documents.Document();
>             Lucene.Net.Documents.Field field = new
> Lucene.Net.Documents.Field("field1", stringToIndex,
> Lucene.Net.Documents.Field.Store.YES,
> Lucene.Net.Documents.Field.Index.TOKENIZED);
>             doc.Add(field);
>             writer.AddDocument(doc);
>             writer.Close();
>
>             Lucene.Net.Search.IndexSearcher searcher = new
> Lucene.Net.Search.IndexSearcher(dir);
>             Lucene.Net.QueryParsers.QueryParser qp = new
> Lucene.Net.QueryParsers.QueryParser("field1", analyzer);
>             Lucene.Net.Search.Query q = qp.Parse(stringToSearch);
>             Lucene.Net.Search.Hits hits = searcher.Search(q);
>             Console.WriteLine(hits.Length().ToString() + " hit(s)");
>         }
>
>
> 2.
> Using StandardAnalyzer, tokens of string "hello alison20 there" are
"hello"
> and "alison20"( as expected ).
>
> 	TokenizeString(new Lucene.Net.Analysis.Standard.StandardAnalyzer() ,
> "hello alison20 there");
>
> 	void TokenizeString(Lucene.Net.Analysis.Analyzer analyzer, string s)
>         {
>             Lucene.Net.Analysis.TokenStream ts = analyzer.TokenStream("",
> new System.IO.StringReader(s));
>             for (Lucene.Net.Analysis.Token t = ts.Next(); t != null; t =
> ts.Next())
>             {
>                 Console.WriteLine(t.TermText() + " " + t.Type());
>             }
>         }
>
>
> DIGY
>
> -----Original Message-----
> From: yin [mailto:yin@AI.SRI.COM] 
> Sent: Saturday, January 05, 2008 2:43 AM
> To: lucene-net-user@incubator.apache.org
> Subject: Strange Indexing Problem with letter-number combination
>
> Hello there!
>
>  
>
> I see a very strange indexing problem that I hope someone can shed a light
> on.
>
>  
>
> I have a StandardAnalyzer (the default one, no special configurations), it
> works great until it hits a file that contains a letter-number combination
> word such as "alison29". I checked the index with Luke and here's the
> strange thing:
>
>  
>
> For text "how are you", I got three index entries as "how", "are", and
> "you", while as for text "hello alison20 there", I got only one index
entry
> as "hello,alison29,there", as a consequence, none of the searches for
> "alison29", for "hello", or for "there" returns anything, it only works if
I
> search precisely for "hello,alison29,there". 
>
>  
>
> I can pad both my index and search keyword but not very comfortable about
> it, and I feel the issue is too obvious to be a overlooked bug, more
likely
> I missed something, perhaps some parameter setting in Lucene
> StandardAnalyzer? Any idea? Thank you very much for your help!
>
>  
>
> Regards,
>
> Min
>
>

Re: Strange Indexing Problem with letter-number combination

Posted by Min Yin <yi...@AI.SRI.COM>.

Hello,

Thanks for the reply! I've found that the problem is caused by the 
commas that separate different words, if I change the commas to spaces 
or semi-colons, then it works fine. Comma also works as long as you 
don't have any digits in the word. Maybe it has something to do with 
"10,000" or that sort?

And I have a second question that somewhat related, if I have text 
"deskbar-abc" indexed, it will be indexed as "deskbar" and "abc", but if 
I have "deskbar-abc288" instead, it will be treated as one word. Is 
there a way to make it work consistently? For example, always keep the 
dash and do not split the word?

Many thanks in advance!
Min

DIGY wrote:
> 1.
> I tried your case with the following code and everything worked as expected.
>
>       Test(new Lucene.Net.Analysis.Standard.StandardAnalyzer(), "hello
> alison20 there", "alison20");
>
> 	void Test(Lucene.Net.Analysis.Analyzer analyzer, string
> stringToIndex, string stringToSearch)
>         {
>             Lucene.Net.Store.RAMDirectory dir = new
> Lucene.Net.Store.RAMDirectory();
>             Lucene.Net.Index.IndexWriter writer = new
> Lucene.Net.Index.IndexWriter(dir, analyzer);
>             Lucene.Net.Documents.Document doc = new
> Lucene.Net.Documents.Document();
>             Lucene.Net.Documents.Field field = new
> Lucene.Net.Documents.Field("field1", stringToIndex,
> Lucene.Net.Documents.Field.Store.YES,
> Lucene.Net.Documents.Field.Index.TOKENIZED);
>             doc.Add(field);
>             writer.AddDocument(doc);
>             writer.Close();
>
>             Lucene.Net.Search.IndexSearcher searcher = new
> Lucene.Net.Search.IndexSearcher(dir);
>             Lucene.Net.QueryParsers.QueryParser qp = new
> Lucene.Net.QueryParsers.QueryParser("field1", analyzer);
>             Lucene.Net.Search.Query q = qp.Parse(stringToSearch);
>             Lucene.Net.Search.Hits hits = searcher.Search(q);
>             Console.WriteLine(hits.Length().ToString() + " hit(s)");
>         }
>
>
> 2.
> Using StandardAnalyzer, tokens of string "hello alison20 there" are "hello"
> and "alison20"( as expected ).
>
> 	TokenizeString(new Lucene.Net.Analysis.Standard.StandardAnalyzer() ,
> "hello alison20 there");
>
> 	void TokenizeString(Lucene.Net.Analysis.Analyzer analyzer, string s)
>         {
>             Lucene.Net.Analysis.TokenStream ts = analyzer.TokenStream("",
> new System.IO.StringReader(s));
>             for (Lucene.Net.Analysis.Token t = ts.Next(); t != null; t =
> ts.Next())
>             {
>                 Console.WriteLine(t.TermText() + " " + t.Type());
>             }
>         }
>
>
> DIGY
>
> -----Original Message-----
> From: yin [mailto:yin@AI.SRI.COM] 
> Sent: Saturday, January 05, 2008 2:43 AM
> To: lucene-net-user@incubator.apache.org
> Subject: Strange Indexing Problem with letter-number combination
>
> Hello there!
>
>  
>
> I see a very strange indexing problem that I hope someone can shed a light
> on.
>
>  
>
> I have a StandardAnalyzer (the default one, no special configurations), it
> works great until it hits a file that contains a letter-number combination
> word such as "alison29". I checked the index with Luke and here's the
> strange thing:
>
>  
>
> For text "how are you", I got three index entries as "how", "are", and
> "you", while as for text "hello alison20 there", I got only one index entry
> as "hello,alison29,there", as a consequence, none of the searches for
> "alison29", for "hello", or for "there" returns anything, it only works if I
> search precisely for "hello,alison29,there". 
>
>  
>
> I can pad both my index and search keyword but not very comfortable about
> it, and I feel the issue is too obvious to be a overlooked bug, more likely
> I missed something, perhaps some parameter setting in Lucene
> StandardAnalyzer? Any idea? Thank you very much for your help!
>
>  
>
> Regards,
>
> Min
>
>

RE: Strange Indexing Problem with letter-number combination

Posted by DIGY <di...@gmail.com>.

1.
I tried your case with the following code and everything worked as expected.

      Test(new Lucene.Net.Analysis.Standard.StandardAnalyzer(), "hello
alison20 there", "alison20");

	void Test(Lucene.Net.Analysis.Analyzer analyzer, string
stringToIndex, string stringToSearch)
        {
            Lucene.Net.Store.RAMDirectory dir = new
Lucene.Net.Store.RAMDirectory();
            Lucene.Net.Index.IndexWriter writer = new
Lucene.Net.Index.IndexWriter(dir, analyzer);
            Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
            Lucene.Net.Documents.Field field = new
Lucene.Net.Documents.Field("field1", stringToIndex,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED);
            doc.Add(field);
            writer.AddDocument(doc);
            writer.Close();

            Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
            Lucene.Net.QueryParsers.QueryParser qp = new
Lucene.Net.QueryParsers.QueryParser("field1", analyzer);
            Lucene.Net.Search.Query q = qp.Parse(stringToSearch);
            Lucene.Net.Search.Hits hits = searcher.Search(q);
            Console.WriteLine(hits.Length().ToString() + " hit(s)");
        }


2.
Using StandardAnalyzer, tokens of string "hello alison20 there" are "hello"
and "alison20"( as expected ).

	TokenizeString(new Lucene.Net.Analysis.Standard.StandardAnalyzer() ,
"hello alison20 there");

	void TokenizeString(Lucene.Net.Analysis.Analyzer analyzer, string s)
        {
            Lucene.Net.Analysis.TokenStream ts = analyzer.TokenStream("",
new System.IO.StringReader(s));
            for (Lucene.Net.Analysis.Token t = ts.Next(); t != null; t =
ts.Next())
            {
                Console.WriteLine(t.TermText() + " " + t.Type());
            }
        }


DIGY

-----Original Message-----
From: yin [mailto:yin@AI.SRI.COM] 
Sent: Saturday, January 05, 2008 2:43 AM
To: lucene-net-user@incubator.apache.org
Subject: Strange Indexing Problem with letter-number combination

Hello there!

 

I see a very strange indexing problem that I hope someone can shed a light
on.

 

I have a StandardAnalyzer (the default one, no special configurations), it
works great until it hits a file that contains a letter-number combination
word such as "alison29". I checked the index with Luke and here's the
strange thing:

 

For text "how are you", I got three index entries as "how", "are", and
"you", while as for text "hello alison20 there", I got only one index entry
as "hello,alison29,there", as a consequence, none of the searches for
"alison29", for "hello", or for "there" returns anything, it only works if I
search precisely for "hello,alison29,there". 

 

I can pad both my index and search keyword but not very comfortable about
it, and I feel the issue is too obvious to be a overlooked bug, more likely
I missed something, perhaps some parameter setting in Lucene
StandardAnalyzer? Any idea? Thank you very much for your help!

 

Regards,

Min