You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Tim Haughton <ti...@gmail.com> on 2011/09/26 14:28:30 UTC

[Lucene.Net] Indexing Oddity

Hi, I'm trying to index a text file containing the following text:

DNE,APLU,GB11/0290
DNE,CMDU,11-1431
DNE,EGLV,NO CONTRACT
DNE,HJSC,ANE112376
DNE,HLCU,NO CONTRACT
DNE,MAEU,547712
DNE,MOLU,NO CONTRACT
DNE,OOLU,AE115029

It appears that each "line" is being indexed as one complete string, rather
than at least 3 terms. So if I search for "547712" I get no results. But if
I search for "DNE,MAEU,547712" I find the document. If I add a space after
each comma it indexes them as individual terms.

Is this expected behaviour using the StandardAnalyzer?

Cheers,

Tim

RE: [Lucene.Net] Indexing Oddity

Posted by "Granroth, Neal V." <ne...@thermofisher.com>.

Yes Tim,

The Standard Analyzer is primarily intended for text documents containing English words and phrases and certain other common pieces of information such as acronyms, phone numbers, and email addresses.

The Standard Analyzer may not the best choice for the formatted, coded, information that you are indexing as it may or may not split the text at a comma based on the text which follows.

If you have not seen it before, here is the source for a small command-line example that displays the tokens produced by a selected analyzer and a given input string. For example:

Standard Analyzer

C:\>ADemo 1 "DNE,APLU,GB11/0290"
[1]: "dne"
[2]: "aplu,gb11/0290"


Whitespace Analyzer

C:\>ADemo 2 "DNE,APLU,GB11/0290"
[1]: "DNE,APLU,GB11/0290"


------------------------------------------------------------------

static void Main(string[] args)
{
   if (args.Length < 2)
      return;
				
   int selector = 0;
				
   if ( ! Int32.TryParse( args[0], out selector ) )
       return;
				
   Lucene.Net.Analysis.Analyzer analyzer = null;
			
   switch( Int32.Parse( args[0]) )
   {
      case 4:
         analyzer = new Lucene.Net.Analysis.SimpleAnalyzer();
         break;

      case 3:
         analyzer = new Lucene.Net.Analysis.StopAnalyzer();
         break;

      case 2:
         analyzer = new Lucene.Net.Analysis.WhitespaceAnalyzer();
         break;

      case 1:
      default:
         analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
         break;

   }
			
   Lucene.Net.Analysis.TokenStream tks = analyzer.TokenStream(
       new System.IO.StringReader(args[1]) );
				
   int tkNum = 1;
   Lucene.Net.Analysis.Token curToken = tks.Next();
   while(curToken != null)
   {
      System.Console.WriteLine("[{0}]: \"{1}\"",
         tkNum++, curToken.TermText() );

      curToken = tks.Next();
   }
}

------------------------------------------------------------------

- Neal

-----Original Message-----
From: Tim Haughton [mailto:timhaughton@gmail.com] 
Sent: Monday, September 26, 2011 7:41 AM
To: lucene-net-user@lucene.apache.org
Subject: Re: [Lucene.Net] Indexing Oddity

internal static void AddToContentIndex(EDMDocument document, string
fullText)
        {
            lock (contentMutex)
            {
                IndexWriter writer = null;
                try
                {
                    EnsureContentIndexIsUnlocked();

                    // Add content
                    var contentIndexFolder = new
FileInfo(App.ContentIndexFolder);
                    writer = new IndexWriter(contentIndexFolder, new
StandardAnalyzer(), false);
                    writer.SetUseCompoundFile(true);

                    var contentDoc = new Document();
                    contentDoc.Add(new Field("content", fullText,
Field.Store.NO, Field.Index.TOKENIZED));
                    contentDoc.Add(new Field("documentID",
document.DocumentID, Field.Store.YES,
                                             Field.Index.UN_TOKENIZED));

                    writer.AddDocument(contentDoc);
                    writer.Optimize();
                }
                catch (Exception exception)
                {
                    log.Error("Problem adding document to content index.",
exception);
                }
                finally
                {
                    if (writer != null)
                    {
                        writer.Close();
                    }
                }
            }
        }

Cheers,

Tim


On 26 September 2011 13:37, Itamar Syn-Hershko <it...@code972.com> wrote:

> No, you are probably using KeywordAnalyzer
>
> What is your indexing code?
>
> On Mon, Sep 26, 2011 at 3:28 PM, Tim Haughton <ti...@gmail.com>
> wrote:
>
> > Hi, I'm trying to index a text file containing the following text:
> >
> > DNE,APLU,GB11/0290
> > DNE,CMDU,11-1431
> > DNE,EGLV,NO CONTRACT
> > DNE,HJSC,ANE112376
> > DNE,HLCU,NO CONTRACT
> > DNE,MAEU,547712
> > DNE,MOLU,NO CONTRACT
> > DNE,OOLU,AE115029
> >
> > It appears that each "line" is being indexed as one complete string,
> rather
> > than at least 3 terms. So if I search for "547712" I get no results. But
> if
> > I search for "DNE,MAEU,547712" I find the document. If I add a space
> after
> > each comma it indexes them as individual terms.
> >
> > Is this expected behaviour using the StandardAnalyzer?
> >
> > Cheers,
> >
> > Tim
> >
>

Re: [Lucene.Net] Indexing Oddity

Posted by Tim Haughton <ti...@gmail.com>.

internal static void AddToContentIndex(EDMDocument document, string
fullText)
        {
            lock (contentMutex)
            {
                IndexWriter writer = null;
                try
                {
                    EnsureContentIndexIsUnlocked();

                    // Add content
                    var contentIndexFolder = new
FileInfo(App.ContentIndexFolder);
                    writer = new IndexWriter(contentIndexFolder, new
StandardAnalyzer(), false);
                    writer.SetUseCompoundFile(true);

                    var contentDoc = new Document();
                    contentDoc.Add(new Field("content", fullText,
Field.Store.NO, Field.Index.TOKENIZED));
                    contentDoc.Add(new Field("documentID",
document.DocumentID, Field.Store.YES,
                                             Field.Index.UN_TOKENIZED));

                    writer.AddDocument(contentDoc);
                    writer.Optimize();
                }
                catch (Exception exception)
                {
                    log.Error("Problem adding document to content index.",
exception);
                }
                finally
                {
                    if (writer != null)
                    {
                        writer.Close();
                    }
                }
            }
        }

Cheers,

Tim


On 26 September 2011 13:37, Itamar Syn-Hershko <it...@code972.com> wrote:

> No, you are probably using KeywordAnalyzer
>
> What is your indexing code?
>
> On Mon, Sep 26, 2011 at 3:28 PM, Tim Haughton <ti...@gmail.com>
> wrote:
>
> > Hi, I'm trying to index a text file containing the following text:
> >
> > DNE,APLU,GB11/0290
> > DNE,CMDU,11-1431
> > DNE,EGLV,NO CONTRACT
> > DNE,HJSC,ANE112376
> > DNE,HLCU,NO CONTRACT
> > DNE,MAEU,547712
> > DNE,MOLU,NO CONTRACT
> > DNE,OOLU,AE115029
> >
> > It appears that each "line" is being indexed as one complete string,
> rather
> > than at least 3 terms. So if I search for "547712" I get no results. But
> if
> > I search for "DNE,MAEU,547712" I find the document. If I add a space
> after
> > each comma it indexes them as individual terms.
> >
> > Is this expected behaviour using the StandardAnalyzer?
> >
> > Cheers,
> >
> > Tim
> >
>

Re: [Lucene.Net] Indexing Oddity

Posted by Itamar Syn-Hershko <it...@code972.com>.

No, you are probably using KeywordAnalyzer

What is your indexing code?

On Mon, Sep 26, 2011 at 3:28 PM, Tim Haughton <ti...@gmail.com> wrote:

> Hi, I'm trying to index a text file containing the following text:
>
> DNE,APLU,GB11/0290
> DNE,CMDU,11-1431
> DNE,EGLV,NO CONTRACT
> DNE,HJSC,ANE112376
> DNE,HLCU,NO CONTRACT
> DNE,MAEU,547712
> DNE,MOLU,NO CONTRACT
> DNE,OOLU,AE115029
>
> It appears that each "line" is being indexed as one complete string, rather
> than at least 3 terms. So if I search for "547712" I get no results. But if
> I search for "DNE,MAEU,547712" I find the document. If I add a space after
> each comma it indexes them as individual terms.
>
> Is this expected behaviour using the StandardAnalyzer?
>
> Cheers,
>
> Tim
>