You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2010/10/04 22:05:31 UTC
Using Solr Analyzers in Lucene
Hi,
I asked this question a month ago on lucene-user and was referred here.
I have content being analyzed in Solr using these tokenizers and filters:
<fieldType name="text_standard" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
</analyzer>
</fieldType>
Basically I want to be able to search against this index in Lucene with one
of my background searching applications.
My main reason for using Lucene over Solr for this is that I use the
highlighter to keep track of exactly which terms were found which I use for
my own scoring system and I always collect the whole set of found
documents. I've messed around with using Boosts but it wasn't fine grained
enough and I wasn't able to effectively create a score threshold (would
creating my own scorer be a better idea?)
Is it possible to use this analyzer from Lucene, or at least re-create it in
code?
Thanks.
RE: Using Solr Analyzers in Lucene
Posted by Mathias Walter <ma...@gmx.net>.
Hi Max,
why don't you use WordDelimiterFilterFactory directly? I'm doing the same
stuff inside my own analyzer:
final Map<String, String> args = new HashMap<String, String>();
args.put("generateWordParts", "1");
args.put("generateNumberParts", "1");
args.put("catenateWords", "0");
args.put("catenateNumbers", "0");
args.put("catenateAll", "0");
args.put("splitOnCaseChange", "1");
args.put("splitOnNumerics", "1");
args.put("preserveOriginal", "1");
args.put("stemEnglishPossessive", "0");
args.put("language", "English");
wordDelimiter = new WordDelimiterFilterFactory();
wordDelimiter.init(args);
stream = wordDelimiter.create(stream);
--
Kind regards,
Mathias
> -----Original Message-----
> From: Max Lynch [mailto:ihasmax@gmail.com]
> Sent: Tuesday, October 05, 2010 1:03 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Solr Analyzers in Lucene
>
> I have made progress on this by writing my own Analyzer. I basically
added
> the TokenFilters that are under each of the solr factory classes. I had
to
> copy and paste the WordDelimiterFilter because, of course, it was package
> protected.
>
>
>
> On Mon, Oct 4, 2010 at 3:05 PM, Max Lynch <ih...@gmail.com> wrote:
>
> > Hi,
> > I asked this question a month ago on lucene-user and was referred here.
> >
> > I have content being analyzed in Solr using these tokenizers and
filters:
> >
> > <fieldType name="text_standard" class="solr.TextField"
> > positionIncrementGap="100">
> > <analyzer type="index">
> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >
> > <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="0" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.SnowballPorterFilterFactory"
language="English"
> > protected="protwords.txt"/>
> > </analyzer>
> > <analyzer type="query">
> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="0" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.SnowballPorterFilterFactory"
language="English"
> > protected="protwords.txt"/>
> > </analyzer>
> > </fieldType>
> >
> > Basically I want to be able to search against this index in Lucene with
one
> > of my background searching applications.
> >
> > My main reason for using Lucene over Solr for this is that I use the
> > highlighter to keep track of exactly which terms were found which I use
for
> > my own scoring system and I always collect the whole set of found
> > documents. I've messed around with using Boosts but it wasn't fine
grained
> > enough and I wasn't able to effectively create a score threshold (would
> > creating my own scorer be a better idea?)
> >
> > Is it possible to use this analyzer from Lucene, or at least re-create
it
> > in code?
> >
> > Thanks.
> >
> >
Re: Using Solr Analyzers in Lucene
Posted by Max Lynch <ih...@gmail.com>.
I guess I missed the init() method. I was looking at the factory and
thought I saw config loading stuff (like getInt) which I assumed meant it
need to have schema.xml available.
Thanks!
-Max
On Tue, Oct 5, 2010 at 2:36 PM, Mathias Walter <ma...@gmx.net>wrote:
> Hi Max,
>
> why don't you use WordDelimiterFilterFactory directly? I'm doing the same
> stuff inside my own analyzer:
>
> final Map<String, String> args = new HashMap<String, String>();
>
> args.put("generateWordParts", "1");
> args.put("generateNumberParts", "1");
> args.put("catenateWords", "0");
> args.put("catenateNumbers", "0");
> args.put("catenateAll", "0");
> args.put("splitOnCaseChange", "1");
> args.put("splitOnNumerics", "1");
> args.put("preserveOriginal", "1");
> args.put("stemEnglishPossessive", "0");
> args.put("language", "English");
>
> wordDelimiter = new WordDelimiterFilterFactory();
> wordDelimiter.init(args);
> stream = wordDelimiter.create(stream);
>
> --
> Kind regards,
> Mathias
>
> > -----Original Message-----
> > From: Max Lynch [mailto:ihasmax@gmail.com]
> > Sent: Tuesday, October 05, 2010 1:03 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Using Solr Analyzers in Lucene
> >
> > I have made progress on this by writing my own Analyzer. I basically
> added
> > the TokenFilters that are under each of the solr factory classes. I had
> to
> > copy and paste the WordDelimiterFilter because, of course, it was package
> > protected.
> >
> >
> >
> > On Mon, Oct 4, 2010 at 3:05 PM, Max Lynch <ih...@gmail.com> wrote:
> >
> > > Hi,
> > > I asked this question a month ago on lucene-user and was referred here.
> > >
> > > I have content being analyzed in Solr using these tokenizers and
> filters:
> > >
> > > <fieldType name="text_standard" class="solr.TextField"
> > > positionIncrementGap="100">
> > > <analyzer type="index">
> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >
> > > <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="0" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > <filter class="solr.LowerCaseFilterFactory"/>
> > > <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > > protected="protwords.txt"/>
> > > </analyzer>
> > > <analyzer type="query">
> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="0" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > <filter class="solr.LowerCaseFilterFactory"/>
> > > <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > > protected="protwords.txt"/>
> > > </analyzer>
> > > </fieldType>
> > >
> > > Basically I want to be able to search against this index in Lucene with
> one
> > > of my background searching applications.
> > >
> > > My main reason for using Lucene over Solr for this is that I use the
> > > highlighter to keep track of exactly which terms were found which I use
> for
> > > my own scoring system and I always collect the whole set of found
> > > documents. I've messed around with using Boosts but it wasn't fine
> grained
> > > enough and I wasn't able to effectively create a score threshold (would
> > > creating my own scorer be a better idea?)
> > >
> > > Is it possible to use this analyzer from Lucene, or at least re-create
> it
> > > in code?
> > >
> > > Thanks.
> > >
> > >
>
>
Re: Using Solr Analyzers in Lucene
Posted by Max Lynch <ih...@gmail.com>.
I have made progress on this by writing my own Analyzer. I basically added
the TokenFilters that are under each of the solr factory classes. I had to
copy and paste the WordDelimiterFilter because, of course, it was package
protected.
On Mon, Oct 4, 2010 at 3:05 PM, Max Lynch <ih...@gmail.com> wrote:
> Hi,
> I asked this question a month ago on lucene-user and was referred here.
>
> I have content being analyzed in Solr using these tokenizers and filters:
>
> <fieldType name="text_standard" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
> </analyzer>
> </fieldType>
>
> Basically I want to be able to search against this index in Lucene with one
> of my background searching applications.
>
> My main reason for using Lucene over Solr for this is that I use the
> highlighter to keep track of exactly which terms were found which I use for
> my own scoring system and I always collect the whole set of found
> documents. I've messed around with using Boosts but it wasn't fine grained
> enough and I wasn't able to effectively create a score threshold (would
> creating my own scorer be a better idea?)
>
> Is it possible to use this analyzer from Lucene, or at least re-create it
> in code?
>
> Thanks.
>
>