You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Trevor Watson <tw...@datassimilate.com> on 2011/06/16 17:31:04 UTC

[Lucene.Net] Analyzer Question for Lucene.Net

I'm trying to get Lucene.Net to create terms the way that we want it to 
happen.  I'm currently running Lucene.Net 2.9.2.2.

Bascially, we want the StandardAnalyzer with the exception that we want 
terms to be divided at a period as well.  The StandardAnalyzer seems to 
only split the 2 words into terms if the period is followed by white-space.

So if we index autoexec.bat it should do [autoexec] and [bat], not 
[autoexec.bat]

I was trying to create my own Analyzer that would do it, but could not 
figure out how.


So far I have a very basic analyzer that uses the StandardTokenizer and 
2 filters.

// --------- code block ----------------------

class ExtendedStandardAnalyzer : Analyzer
{
     public override TokenStream TokenStream(string fieldName, 
System.IO.TextReader reader)
     {
         TokenStream ersult = new 
StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
         // TokenStream result = new LetterTokenizer(reader); // doesn't 
work because we want numbers

         result = new StandardFilter(result);
         result = new LowerCaseFilter(result);

         return result;
     }
}
// --------- end code block ------------------


Thanks in advance.

RE: [Lucene.Net] Analyzer Question for Lucene.Net

Posted by Digy <di...@gmail.com>.
Take a look at UnaccentedWordAnalyzer in
https://svn.apache.org/repos/asf/incubator/lucene.net/branches/Lucene.Net_2_
9_4g/src/contrib/Core/Analysis/Ext/Analysis.Ext.cs

If you want, you can remove the "ASCIIFoldingFilter" from the chain.
DIGY

-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com] 
Sent: Thursday, June 16, 2011 6:31 PM
To: lucene-net-user@lucene.apache.org
Subject: [Lucene.Net] Analyzer Question for Lucene.Net

I'm trying to get Lucene.Net to create terms the way that we want it to 
happen.  I'm currently running Lucene.Net 2.9.2.2.

Bascially, we want the StandardAnalyzer with the exception that we want 
terms to be divided at a period as well.  The StandardAnalyzer seems to 
only split the 2 words into terms if the period is followed by white-space.

So if we index autoexec.bat it should do [autoexec] and [bat], not 
[autoexec.bat]

I was trying to create my own Analyzer that would do it, but could not 
figure out how.


So far I have a very basic analyzer that uses the StandardTokenizer and 
2 filters.

// --------- code block ----------------------

class ExtendedStandardAnalyzer : Analyzer
{
     public override TokenStream TokenStream(string fieldName, 
System.IO.TextReader reader)
     {
         TokenStream ersult = new 
StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
         // TokenStream result = new LetterTokenizer(reader); // doesn't 
work because we want numbers

         result = new StandardFilter(result);
         result = new LowerCaseFilter(result);

         return result;
     }
}
// --------- end code block ------------------


Thanks in advance.


RE: [Lucene.Net] Analyzer Question for Lucene.Net

Posted by Franklin Simmons <fs...@sccmediaserver.com>.
Your workaround will break acronyms, which may or may not be important to your effort.   Consider extending TokenFilter to filter away tokens of type HOST.   A performance gain should result and it won't break acronyms. Just be sure the tokens resulting from the decomposition are correct!

E.G.

public class HostFilter : TokenFilter
{
   Queue<Token> fifo;   

    public override Token Next()
    {
        Token token = null;     
        if (fifo.Count > 0)      
            token = fifo.Dequeue();     
        else
        {
            token = input.Next();
            if (token != null)
            {   
                if(token.Type() == "HOST")
                {
	   // The UnHost method must push onto fifo new tokens with the correct length, 
	   // start and end offsets, position increment etc.  
                    UnHost(token);
                    token = fifo.Dequeue();
                }
            }
        }         
        return token;
    }

-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com] 
Sent: Thursday, June 16, 2011 11:50 AM
To: lucene-net-user@lucene.apache.org
Subject: Re: [Lucene.Net] Analyzer Question for Lucene.Net

I figured out a work-around in the custom analyzer by doing the folllowing

// --------------- code block ---------------------
     public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
     {
         TextReader newReader = new
StringReader(reader.ReadToEnd().Replace(".", ". "));
         TokenStream result = new
StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, newReader);

// -------------end code block -----------------


It seems to work this way.  Thanks again.




On 06/16/2011 11:31 AM, Trevor Watson wrote:
> I'm trying to get Lucene.Net to create terms the way that we want it 
> to happen.  I'm currently running Lucene.Net 2.9.2.2.
>
> Bascially, we want the StandardAnalyzer with the exception that we 
> want terms to be divided at a period as well.  The StandardAnalyzer 
> seems to only split the 2 words into terms if the period is followed 
> by white-space.
>
> So if we index autoexec.bat it should do [autoexec] and [bat], not 
> [autoexec.bat]
>
> I was trying to create my own Analyzer that would do it, but could not 
> figure out how.
>
>
> So far I have a very basic analyzer that uses the StandardTokenizer 
> and 2 filters.
>
> // --------- code block ----------------------
>
> class ExtendedStandardAnalyzer : Analyzer {
>     public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>     {
>         TokenStream result = new
> StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
>         // TokenStream result = new LetterTokenizer(reader); // 
> doesn't work because we want numbers
>
>         result = new StandardFilter(result);
>         result = new LowerCaseFilter(result);
>
>         return result;
>     }
> }
> // --------- end code block ------------------
>
>
> Thanks in advance.


Re: [Lucene.Net] Analyzer Question for Lucene.Net

Posted by Trevor Watson <tw...@datassimilate.com>.
I figured out a work-around in the custom analyzer by doing the folllowing

// --------------- code block ---------------------
     public override TokenStream TokenStream(string fieldName, 
System.IO.TextReader reader)
     {
         TextReader newReader = new 
StringReader(reader.ReadToEnd().Replace(".", ". "));
         TokenStream result = new 
StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, newReader);

// -------------end code block -----------------


It seems to work this way.  Thanks again.




On 06/16/2011 11:31 AM, Trevor Watson wrote:
> I'm trying to get Lucene.Net to create terms the way that we want it 
> to happen.  I'm currently running Lucene.Net 2.9.2.2.
>
> Bascially, we want the StandardAnalyzer with the exception that we 
> want terms to be divided at a period as well.  The StandardAnalyzer 
> seems to only split the 2 words into terms if the period is followed 
> by white-space.
>
> So if we index autoexec.bat it should do [autoexec] and [bat], not 
> [autoexec.bat]
>
> I was trying to create my own Analyzer that would do it, but could not 
> figure out how.
>
>
> So far I have a very basic analyzer that uses the StandardTokenizer 
> and 2 filters.
>
> // --------- code block ----------------------
>
> class ExtendedStandardAnalyzer : Analyzer
> {
>     public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>     {
>         TokenStream result = new 
> StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
>         // TokenStream result = new LetterTokenizer(reader); // 
> doesn't work because we want numbers
>
>         result = new StandardFilter(result);
>         result = new LowerCaseFilter(result);
>
>         return result;
>     }
> }
> // --------- end code block ------------------
>
>
> Thanks in advance.