You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stephen Thomas <st...@cs.queensu.ca> on 2011/11/29 17:19:38 UTC

Custom Filter for Splitting CamelCase?

List,

I have written my own CustomAnalyzer, as follows:

public TokenStream tokenStream(String fieldName, Reader reader) {

		// TODO: add calls to RemovePuncation, and SplitIdentifiers here
		
		// First, convert to lower case
		TokenStream out = new  LowerCaseTokenizer(reader);

		if (this.doStopping){
			out = new StopFilter(true, out, customStopSet);
		}
		
		if (this.doStemming){
			out = new PorterStemFilter(out);
		}

		return out;
	  }



What I need to do is write two custom filters that do the following:

- RemovePuncation() removes all characters except [a-zA-Z], preserving
case. E.g.,

"foo=bar*45;" ==> "foo bar 45"
"fooBar" ==> "fooBar"
"\"sthomas@cs.queensu.ca\"" ==> "sthomas cs queensu ca"


- SplitIdentifers() breaks up words based on camelCase notation:

"fooBar" ==> "foo Bar"
"ABCCompany" ==> "ABC Company"

(I have the regex for this.)

Note this step must be performed before LowerCaseTokenizer, because we
need case information to do the splitting.


How can I write custom filters, and how do I call them before
LowerCaseTokenizer()?


Thanks in advance,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Custom Filter for Splitting CamelCase?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

Be sure to use the same Solr version as your Lucene version (if >= 3.1) and
this is example code from test case:

    WordDelimiterFilterFactory fact = new WordDelimiterFilterFactory();
    // we don’t need this if we don’t load external exclusion files:
    // ResourceLoader loader = new SolrResourceLoader(null, null);
    Map<String,String> args = new HashMap<String,String>();
    args.put("generateWordParts", "1");
    args.put("generateNumberParts", "1");
    args.put("catenateWords", "1");
    args.put("catenateNumbers", "1");
    args.put("catenateAll", "0");
    args.put("splitOnCaseChange", "1");
    fact.init(args);
    // fact.inform(loader);
    
    TokenStream ts = fact.create(new LowerCaseTokenizer(reader));


For all args params look here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimit
erFilterFactory

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: stephen.warner.thomas@gmail.com
> [mailto:stephen.warner.thomas@gmail.com] On Behalf Of Stephen Thomas
> Sent: Tuesday, November 29, 2011 7:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: Custom Filter for Splitting CamelCase?
> 
> How do you use the WordDelimiterFilterFactory()? I tried the following
code:
> 
> 
> TokenStream out = new  LowerCaseTokenizer(reader);
> WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory(); out =
> wdf.create(out); ...
> 
> But I am getting a runtime error:
> 
> Exception in thread "main" java.lang.AbstractMethodError:
> org.apache.lucene.analysis.TokenStream.incrementToken()Z
> 	at
> org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141)
> 	at
>
org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.
j
> ava:54)
>         ...
> 
> I can't create a class of type WordDelimiterFilter directly, because it is
> protected.
> 
> Any ideas?
> 
> Thanks,
> Steve
> 
> 
> 
> 
> On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> > Hi,
> >
> > There is WordDelimiterFilter in Solr that was also ported to Lucene
> > Analysis module in Lucene trunk (4.0). In 3.x yu can still add
> > solr.jar to your classpath and WordDelimiterFilterFactory to produce
> > one (WordDelimiterFilter itself is package-private).
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> >> -----Original Message-----
> >> From: stephen.warner.thomas@gmail.com
> >> [mailto:stephen.warner.thomas@gmail.com] On Behalf Of Stephen Thomas
> >> Sent: Tuesday, November 29, 2011 5:20 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Custom Filter for Splitting CamelCase?
> >>
> >> List,
> >>
> >> I have written my own CustomAnalyzer, as follows:
> >>
> >> public TokenStream tokenStream(String fieldName, Reader reader) {
> >>
> >>               // TODO: add calls to RemovePuncation, and
> >> SplitIdentifiers here
> >>
> >>               // First, convert to lower case
> >>               TokenStream out = new  LowerCaseTokenizer(reader);
> >>
> >>               if (this.doStopping){
> >>                       out = new StopFilter(true, out, customStopSet);
> >>               }
> >>
> >>               if (this.doStemming){
> >>                       out = new PorterStemFilter(out);
> >>               }
> >>
> >>               return out;
> >>         }
> >>
> >>
> >>
> >> What I need to do is write two custom filters that do the following:
> >>
> >> - RemovePuncation() removes all characters except [a-zA-Z],
> >> preserving
> > case.
> >> E.g.,
> >>
> >> "foo=bar*45;" ==> "foo bar 45"
> >> "fooBar" ==> "fooBar"
> >> "\"sthomas@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
> >>
> >>
> >> - SplitIdentifers() breaks up words based on camelCase notation:
> >>
> >> "fooBar" ==> "foo Bar"
> >> "ABCCompany" ==> "ABC Company"
> >>
> >> (I have the regex for this.)
> >>
> >> Note this step must be performed before LowerCaseTokenizer, because
> >> we need case information to do the splitting.
> >>
> >>
> >> How can I write custom filters, and how do I call them before
> >> LowerCaseTokenizer()?
> >>
> >>
> >> Thanks in advance,
> >> Steve
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Filter for Splitting CamelCase?

Posted by Stephen Thomas <st...@cs.queensu.ca>.
How do you use the WordDelimiterFilterFactory()? I tried the following code:


TokenStream out = new  LowerCaseTokenizer(reader);
WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory();
out = wdf.create(out);
...

But I am getting a runtime error:

Exception in thread "main" java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
	at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141)
	at org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.java:54)
        ...

I can't create a class of type WordDelimiterFilter directly, because
it is protected.

Any ideas?

Thanks,
Steve




On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Hi,
>
> There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis
> module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your
> classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter
> itself is package-private).
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: stephen.warner.thomas@gmail.com
>> [mailto:stephen.warner.thomas@gmail.com] On Behalf Of Stephen Thomas
>> Sent: Tuesday, November 29, 2011 5:20 PM
>> To: java-user@lucene.apache.org
>> Subject: Custom Filter for Splitting CamelCase?
>>
>> List,
>>
>> I have written my own CustomAnalyzer, as follows:
>>
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>
>>               // TODO: add calls to RemovePuncation, and SplitIdentifiers
>> here
>>
>>               // First, convert to lower case
>>               TokenStream out = new  LowerCaseTokenizer(reader);
>>
>>               if (this.doStopping){
>>                       out = new StopFilter(true, out, customStopSet);
>>               }
>>
>>               if (this.doStemming){
>>                       out = new PorterStemFilter(out);
>>               }
>>
>>               return out;
>>         }
>>
>>
>>
>> What I need to do is write two custom filters that do the following:
>>
>> - RemovePuncation() removes all characters except [a-zA-Z], preserving
> case.
>> E.g.,
>>
>> "foo=bar*45;" ==> "foo bar 45"
>> "fooBar" ==> "fooBar"
>> "\"sthomas@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
>>
>>
>> - SplitIdentifers() breaks up words based on camelCase notation:
>>
>> "fooBar" ==> "foo Bar"
>> "ABCCompany" ==> "ABC Company"
>>
>> (I have the regex for this.)
>>
>> Note this step must be performed before LowerCaseTokenizer, because we
>> need case information to do the splitting.
>>
>>
>> How can I write custom filters, and how do I call them before
>> LowerCaseTokenizer()?
>>
>>
>> Thanks in advance,
>> Steve
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Custom Filter for Splitting CamelCase?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis
module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your
classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter
itself is package-private).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: stephen.warner.thomas@gmail.com
> [mailto:stephen.warner.thomas@gmail.com] On Behalf Of Stephen Thomas
> Sent: Tuesday, November 29, 2011 5:20 PM
> To: java-user@lucene.apache.org
> Subject: Custom Filter for Splitting CamelCase?
> 
> List,
> 
> I have written my own CustomAnalyzer, as follows:
> 
> public TokenStream tokenStream(String fieldName, Reader reader) {
> 
> 		// TODO: add calls to RemovePuncation, and SplitIdentifiers
> here
> 
> 		// First, convert to lower case
> 		TokenStream out = new  LowerCaseTokenizer(reader);
> 
> 		if (this.doStopping){
> 			out = new StopFilter(true, out, customStopSet);
> 		}
> 
> 		if (this.doStemming){
> 			out = new PorterStemFilter(out);
> 		}
> 
> 		return out;
> 	  }
> 
> 
> 
> What I need to do is write two custom filters that do the following:
> 
> - RemovePuncation() removes all characters except [a-zA-Z], preserving
case.
> E.g.,
> 
> "foo=bar*45;" ==> "foo bar 45"
> "fooBar" ==> "fooBar"
> "\"sthomas@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
> 
> 
> - SplitIdentifers() breaks up words based on camelCase notation:
> 
> "fooBar" ==> "foo Bar"
> "ABCCompany" ==> "ABC Company"
> 
> (I have the regex for this.)
> 
> Note this step must be performed before LowerCaseTokenizer, because we
> need case information to do the splitting.
> 
> 
> How can I write custom filters, and how do I call them before
> LowerCaseTokenizer()?
> 
> 
> Thanks in advance,
> Steve
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org