You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Jennifer Wilson <je...@researchintegrations.com> on 2011/06/23 19:04:21 UTC

[Lucene.Net] Advice for troubleshooting inconsistent number of terms added to "contents" field?

Hi all,

I'm writing to ask for advice about troubleshooting what seems like a
strange error.  When I index my test set of files, the number of terms
that are added to the Lucene index in my "contents" field changes each
time I run a fresh index.  Investigating further reveals that this number
of terms discrepancy appears to be the result of only SOME of my files
having had the "contents" field populated during the indexing.   Sometimes
out of the 80 files it appears to add the "contents" for only 4 of the
files, sometimes for 7, sometimes 15 and once none of the files had their
"contents" added. However, the other fields like ID, filename, filepath,
etc.. are correctly added for ALL files EVERYTIME... it is only the
"contents" that is experiencing this problem.

(Note: To determine this, I clear the index by creating a new index over
the old one and commit the changes.  I visually verify the index is
cleared using Luke. I then run the indexing on my 80 files.  I re-open
Luke and view the Term count in for "contents" field.  I then pick a word
that I know exists in every file like "the" and conduct a search on that
word [contents:the].  The resulting documents is the number I am assuming
actually had their contents fields added.)


So, I'm really baffled and can't figure out where the process is going
wrong.  Can anyone offer any advice on troubleshooting this error?


Below is some information about the specifics of my project that may shed
some light...

I am using Visual Studio 2008 C# and created my indexing code in a Windows
Forms project. I created the DLLs from Apache-Lucene.Net-2.9.2-src. The
files I am indexing are .aspx files and so I am using the
Lucene.Net.Demo.Html.HTMLParser to remove the tagging within the file
before sending it into my analyzer.

I've provided some snippets of code below to show some (possibly?)
relevant details...


The code for the CreateIndexWriter() method:
--------------------------------------------------------------------------
---------------
        public void CreateIndexWriter()
        {
            // Create Lucene IndexWriter
            DirectoryInfo dirInfo;
            Boolean createNewIndex = true;

            // Assign directory info to dirInfo variable.
            dirInfo = new DirectoryInfo(PATHINDEX);

            // Determine if index directory exists
            if (Directory.Exists(PATHINDEX))
            {
                // Index directory exists. Assign boolean createNewIndex
                // to false so that IndexWriter will add to existing
                // index.
                createNewIndex = false;
            }

            Analyzer analyzer = new MyAnalyzer();

            // Create Index Writer
            writer = new IndexWriter(FSDirectory.Open(dirInfo), analyzer,
createNewIndex, IndexWriter.MaxFieldLength.UNLIMITED);
        }
==========================================================================
===============


Class definition of my custom analyzer:
--------------------------------------------------------------------------
---------------
    public class MyAnalyzer : Analyzer
    {
        public override TokenStream TokenStream(string fieldName,
System.IO.TextReader reader)
        {

            //Create the tokenizer
            TokenStream result = new StandardTokenizer(reader);

            //add in filters
            result = new StandardFilter(result); // first normalize the
StandardTokenizer
            result = new LowerCaseFilter(result);// makes sure everything
is lower case

            //return the built token stream.
            return result;
        }
    }
}
==========================================================================
===============


The indexFile method that calls the BuildDocument method and then adds the
Document to the index
--------------------------------------------------------------------------
---------------
        private void indexFile(FileInfo f)
        {
            // Build Lucene Document record for file
            Document doc = BuildDocument(f);

            // Add Lucene Document to the Lucene Index
            writer.AddDocument(doc);
        }
==========================================================================
===============


The portion of the code in the BuildDocument method to add the "contents"
(it is taken directly from the
Apache-Lucene.Net-2.9.2-src.src.Demo.IndexHtml example):
--------------------------------------------------------------------------
---------------
protected Document BuildDocument (FileInfo f)
{
...

     System.IO.FileStream fis = new System.IO.FileStream(f.FullName,
System.IO.FileMode.Open, System.IO.FileAccess.Read);
     HTMLParser parser = new HTMLParser(fis);

     // Add the main text of the file as a field named "contents".  Use a
field that is
     // indexed (i.e. searchable), tokenized with the word position
information preserved,
     // but the original text should not be stored.
     doc.Add(new Field("contents", parser.GetReader()));

...
==========================================================================
===============


Any advice would be very welcome!

Thank you in advance,
Jennifer

RE: [Lucene.Net] Advice for troubleshooting inconsistent number of terms added to "contents" field?

Posted by Jennifer Wilson <je...@researchintegrations.com>.
Hi Digy,

Your suggestion worked like a charm - it fixed the problem and the
indexing code is now working great (and giving me these same great results
everytime)!

Thank you so very much!!
Jennifer


-----Original Message-----
From: Digy [mailto:digydigy@gmail.com]
Sent: Thursday, June 23, 2011 10:54 AM
To: lucene-net-user@lucene.apache.org
Subject: RE: [Lucene.Net] Advice for troubleshooting inconsistent number
of terms added to "contents" field?

Although I am a Lucene.Net user for many years, I have never used
HTMLParser
in demo and tested it.

When I look at the code, I see many threading related code. So there might
be a synchronization bug.





I would recommend to grab the HTMLStripCharFilter.cs

from
https://github.com/synhershko/Lucene.Net.Contrib/blob/master/Lucene.Net.Co
nt
rib/Analysis/HTMLStripCharFilter.cs

(ported to C# by synhershko)



and using an analyzer something like that



public class HtmlStripAnalyzer : Analyzer

{

      public override TokenStream TokenStream(string fieldName, TextReader
reader)

      {

            return new LowerCaseFilter(new StandardTokenizer(new
HTMLStripCharFilter(Lucene.Net.Analysis.CharReader.Get(reader))));

      }

}





DIGY



-----Original Message-----
From: Jennifer Wilson [mailto:jennifer.wilson@researchintegrations.com]
Sent: Thursday, June 23, 2011 8:04 PM
To: lucene-net-user@lucene.apache.org
Subject: [Lucene.Net] Advice for troubleshooting inconsistent number of
terms added to "contents" field?



Hi all,



I'm writing to ask for advice about troubleshooting what seems like a

strange error.  When I index my test set of files, the number of terms

that are added to the Lucene index in my "contents" field changes each

time I run a fresh index.  Investigating further reveals that this number

of terms discrepancy appears to be the result of only SOME of my files

having had the "contents" field populated during the indexing.   Sometimes

out of the 80 files it appears to add the "contents" for only 4 of the

files, sometimes for 7, sometimes 15 and once none of the files had their

"contents" added. However, the other fields like ID, filename, filepath,

etc.. are correctly added for ALL files EVERYTIME... it is only the

"contents" that is experiencing this problem.



(Note: To determine this, I clear the index by creating a new index over

the old one and commit the changes.  I visually verify the index is

cleared using Luke. I then run the indexing on my 80 files.  I re-open

Luke and view the Term count in for "contents" field.  I then pick a word

that I know exists in every file like "the" and conduct a search on that

word [contents:the].  The resulting documents is the number I am assuming

actually had their contents fields added.)





So, I'm really baffled and can't figure out where the process is going

wrong.  Can anyone offer any advice on troubleshooting this error?





Below is some information about the specifics of my project that may shed

some light...



I am using Visual Studio 2008 C# and created my indexing code in a Windows

Forms project. I created the DLLs from Apache-Lucene.Net-2.9.2-src. The

files I am indexing are .aspx files and so I am using the

Lucene.Net.Demo.Html.HTMLParser to remove the tagging within the file

before sending it into my analyzer.



I've provided some snippets of code below to show some (possibly?)

relevant details...





The code for the CreateIndexWriter() method:

--------------------------------------------------------------------------

---------------

        public void CreateIndexWriter()

        {

            // Create Lucene IndexWriter

            DirectoryInfo dirInfo;

            Boolean createNewIndex = true;



            // Assign directory info to dirInfo variable.

            dirInfo = new DirectoryInfo(PATHINDEX);



            // Determine if index directory exists

            if (Directory.Exists(PATHINDEX))

            {

                // Index directory exists. Assign boolean createNewIndex

                // to false so that IndexWriter will add to existing

                // index.

                createNewIndex = false;

            }



            Analyzer analyzer = new MyAnalyzer();



            // Create Index Writer

            writer = new IndexWriter(FSDirectory.Open(dirInfo), analyzer,

createNewIndex, IndexWriter.MaxFieldLength.UNLIMITED);

        }

==========================================================================

===============





Class definition of my custom analyzer:

--------------------------------------------------------------------------

---------------

    public class MyAnalyzer : Analyzer

    {

        public override TokenStream TokenStream(string fieldName,

System.IO.TextReader reader)

        {



            //Create the tokenizer

            TokenStream result = new StandardTokenizer(reader);



            //add in filters

            result = new StandardFilter(result); // first normalize the

StandardTokenizer

            result = new LowerCaseFilter(result);// makes sure everything

is lower case



            //return the built token stream.

            return result;

        }

    }

}

==========================================================================

===============





The indexFile method that calls the BuildDocument method and then adds the

Document to the index

--------------------------------------------------------------------------

---------------

        private void indexFile(FileInfo f)

        {

            // Build Lucene Document record for file

            Document doc = BuildDocument(f);



            // Add Lucene Document to the Lucene Index

            writer.AddDocument(doc);

        }

==========================================================================

===============





The portion of the code in the BuildDocument method to add the "contents"

(it is taken directly from the

Apache-Lucene.Net-2.9.2-src.src.Demo.IndexHtml example):

--------------------------------------------------------------------------

---------------

protected Document BuildDocument (FileInfo f)

{

...



     System.IO.FileStream fis = new System.IO.FileStream(f.FullName,

System.IO.FileMode.Open, System.IO.FileAccess.Read);

     HTMLParser parser = new HTMLParser(fis);



     // Add the main text of the file as a field named "contents".  Use a

field that is

     // indexed (i.e. searchable), tokenized with the word position

information preserved,

     // but the original text should not be stored.

     doc.Add(new Field("contents", parser.GetReader()));



...

==========================================================================

===============





Any advice would be very welcome!



Thank you in advance,

Jennifer

RE: [Lucene.Net] Advice for troubleshooting inconsistent number of terms added to "contents" field?

Posted by Digy <di...@gmail.com>.
Although I am a Lucene.Net user for many years, I have never used HTMLParser
in demo and tested it.

When I look at the code, I see many threading related code. So there might
be a synchronization bug.

 

 

I would recommend to grab the HTMLStripCharFilter.cs 

from
https://github.com/synhershko/Lucene.Net.Contrib/blob/master/Lucene.Net.Cont
rib/Analysis/HTMLStripCharFilter.cs

(ported to C# by synhershko)

 

and using an analyzer something like that

 

public class HtmlStripAnalyzer : Analyzer

{

      public override TokenStream TokenStream(string fieldName, TextReader
reader)

      {

            return new LowerCaseFilter(new StandardTokenizer(new
HTMLStripCharFilter(Lucene.Net.Analysis.CharReader.Get(reader))));

      }

}

 

 

DIGY

 

-----Original Message-----
From: Jennifer Wilson [mailto:jennifer.wilson@researchintegrations.com] 
Sent: Thursday, June 23, 2011 8:04 PM
To: lucene-net-user@lucene.apache.org
Subject: [Lucene.Net] Advice for troubleshooting inconsistent number of
terms added to "contents" field?

 

Hi all,

 

I'm writing to ask for advice about troubleshooting what seems like a

strange error.  When I index my test set of files, the number of terms

that are added to the Lucene index in my "contents" field changes each

time I run a fresh index.  Investigating further reveals that this number

of terms discrepancy appears to be the result of only SOME of my files

having had the "contents" field populated during the indexing.   Sometimes

out of the 80 files it appears to add the "contents" for only 4 of the

files, sometimes for 7, sometimes 15 and once none of the files had their

"contents" added. However, the other fields like ID, filename, filepath,

etc.. are correctly added for ALL files EVERYTIME... it is only the

"contents" that is experiencing this problem.

 

(Note: To determine this, I clear the index by creating a new index over

the old one and commit the changes.  I visually verify the index is

cleared using Luke. I then run the indexing on my 80 files.  I re-open

Luke and view the Term count in for "contents" field.  I then pick a word

that I know exists in every file like "the" and conduct a search on that

word [contents:the].  The resulting documents is the number I am assuming

actually had their contents fields added.)

 

 

So, I'm really baffled and can't figure out where the process is going

wrong.  Can anyone offer any advice on troubleshooting this error?

 

 

Below is some information about the specifics of my project that may shed

some light...

 

I am using Visual Studio 2008 C# and created my indexing code in a Windows

Forms project. I created the DLLs from Apache-Lucene.Net-2.9.2-src. The

files I am indexing are .aspx files and so I am using the

Lucene.Net.Demo.Html.HTMLParser to remove the tagging within the file

before sending it into my analyzer.

 

I've provided some snippets of code below to show some (possibly?)

relevant details...

 

 

The code for the CreateIndexWriter() method:

--------------------------------------------------------------------------

---------------

        public void CreateIndexWriter()

        {

            // Create Lucene IndexWriter

            DirectoryInfo dirInfo;

            Boolean createNewIndex = true;

 

            // Assign directory info to dirInfo variable.

            dirInfo = new DirectoryInfo(PATHINDEX);

 

            // Determine if index directory exists

            if (Directory.Exists(PATHINDEX))

            {

                // Index directory exists. Assign boolean createNewIndex

                // to false so that IndexWriter will add to existing

                // index.

                createNewIndex = false;

            }

 

            Analyzer analyzer = new MyAnalyzer();

 

            // Create Index Writer

            writer = new IndexWriter(FSDirectory.Open(dirInfo), analyzer,

createNewIndex, IndexWriter.MaxFieldLength.UNLIMITED);

        }

==========================================================================

===============

 

 

Class definition of my custom analyzer:

--------------------------------------------------------------------------

---------------

    public class MyAnalyzer : Analyzer

    {

        public override TokenStream TokenStream(string fieldName,

System.IO.TextReader reader)

        {

 

            //Create the tokenizer

            TokenStream result = new StandardTokenizer(reader);

 

            //add in filters

            result = new StandardFilter(result); // first normalize the

StandardTokenizer

            result = new LowerCaseFilter(result);// makes sure everything

is lower case

 

            //return the built token stream.

            return result;

        }

    }

}

==========================================================================

===============

 

 

The indexFile method that calls the BuildDocument method and then adds the

Document to the index

--------------------------------------------------------------------------

---------------

        private void indexFile(FileInfo f)

        {

            // Build Lucene Document record for file

            Document doc = BuildDocument(f);

 

            // Add Lucene Document to the Lucene Index

            writer.AddDocument(doc);

        }

==========================================================================

===============

 

 

The portion of the code in the BuildDocument method to add the "contents"

(it is taken directly from the

Apache-Lucene.Net-2.9.2-src.src.Demo.IndexHtml example):

--------------------------------------------------------------------------

---------------

protected Document BuildDocument (FileInfo f)

{

...

 

     System.IO.FileStream fis = new System.IO.FileStream(f.FullName,

System.IO.FileMode.Open, System.IO.FileAccess.Read);

     HTMLParser parser = new HTMLParser(fis);

 

     // Add the main text of the file as a field named "contents".  Use a

field that is

     // indexed (i.e. searchable), tokenized with the word position

information preserved,

     // but the original text should not be stored.

     doc.Add(new Field("contents", parser.GetReader()));

 

...

==========================================================================

===============

 

 

Any advice would be very welcome!

 

Thank you in advance,

Jennifer