You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Wolanin <pe...@acquia.com> on 2009/11/10 22:06:52 UTC

any docs on solr.EdgeNGramFilterFactory?

This fairly recent blog post:

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
for the index.  I don't see any mention of that tokenizer on the Solr
wiki - is it just waiting to be added, or is there any other
documentation in addition to the blog post?  In particular, there was
a thread last year about using an N-gram tokenizer to enable
reasonable (if not ideal) searching of CJK text, so I'd be curious to
know how people are configuring their schema (with this tokenizer?)
for that use case.

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com

Re: any docs on solr.EdgeNGramFilterFactory?

Posted by Robert Muir <rc...@gmail.com>.
ah, thanks, i'll tentatively set one in the future, but definitely not 2.9.x

more just to show you the idea, you can do different things depending on
different runs of writing systems in text.
but it doesnt solve everything: you only know its Latin script, not english,
so you can't safely automatically do anything like stemming.

say your content is only chinese, english:

the analyzer won't know your latin script text is english, versus say,
french from the unicode, so it won't stem it.
but that analyzer will lowercase it. it won't know if your ideographs are
chinese or japanese, but it will use n-gram tokenization, you get the drift.

in that impl, it puts the script code in the flags so downstream you could
do something like stemming if you happen to know more than is evident from
the unicode.

On Fri, Nov 13, 2009 at 6:23 PM, Peter Wolanin <pe...@acquia.com>wrote:

> Thanks for the link - there doesn't seem a be a fix version specified,
> so I guess this will not officially ship with lucene 2.9?
>
> -Peter
>
> On Wed, Nov 11, 2009 at 10:36 PM, Robert Muir <rc...@gmail.com> wrote:
> > Peter, here is a project that does this:
> > http://issues.apache.org/jira/browse/LUCENE-1488
> >
> >
> >> That's kind of interesting - in general can I build a custom tokenizer
> >> from existing tokenizers that treats different parts of the input
> >> differently based on the utf-8 range of the characters?  E.g. use a
> >> porter stemmer for stretches of Latin text and n-gram or something
> >> else for CJK?
> >>
> >> -Peter
> >>
> >> On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
> >> <ot...@yahoo.com> wrote:
> >> > Yes, that's the n-gram one.  I believe the existing CJK one in Lucene
> is
> >> really just an n-gram tokenizer, so no different than the normal n-gram
> >> tokenizer.
> >> >
> >> > Otis
> >> > --
> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> >> From: Peter Wolanin <pe...@acquia.com>
> >> >> To: solr-user@lucene.apache.org
> >> >> Sent: Tue, November 10, 2009 7:34:37 PM
> >> >> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
> >> >>
> >> >> So, this is the normal N-gram one?  NGramTokenizerFactory
> >> >>
> >> >> Digging deeper - there are actualy CJK and Chinese tokenizers in the
> >> >> Solr codebase:
> >> >>
> >> >>
> >>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
> >> >>
> >>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
> >> >>
> >> >> The CJK one uses the lucene CJKTokenizer
> >> >>
> >>
> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
> >> >>
> >> >> and there seems to be another one even that no one has wrapped into
> >> Solr:
> >> >>
> >>
> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
> >> >>
> >> >> So seems like the existing options are a little better than I
> thought,
> >> >> though it would be nice to have some docs on properly configuring
> >> >> these.
> >> >>
> >> >> -Peter
> >> >>
> >> >> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
> >> >> wrote:
> >> >> > Peter,
> >> >> >
> >> >> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but
> >> just
> >> >> n-grams.
> >> >> > Before you take the n-gram route, you may want to look at the smart
> >> Chinese
> >> >> analyzer in Lucene contrib (I think it works only for Simplified
> >> Chinese) and
> >> >> Sen (on java.net).  I also spotted a Korean analyzer in the wild a
> few
> >> months
> >> >> back.
> >> >> >
> >> >> > Otis
> >> >> > --
> >> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >> >> >
> >> >> >
> >> >> >
> >> >> > ----- Original Message ----
> >> >> >> From: Peter Wolanin
> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> Sent: Tue, November 10, 2009 4:06:52 PM
> >> >> >> Subject: any docs on solr.EdgeNGramFilterFactory?
> >> >> >>
> >> >> >> This fairly recent blog post:
> >> >> >>
> >> >> >>
> >> >>
> >>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >> >> >>
> >> >> >> describes the use of the solr.EdgeNGramFilterFactory as the
> tokenizer
> >> >> >> for the index.  I don't see any mention of that tokenizer on the
> Solr
> >> >> >> wiki - is it just waiting to be added, or is there any other
> >> >> >> documentation in addition to the blog post?  In particular, there
> was
> >> >> >> a thread last year about using an N-gram tokenizer to enable
> >> >> >> reasonable (if not ideal) searching of CJK text, so I'd be curious
> to
> >> >> >> know how people are configuring their schema (with this
> tokenizer?)
> >> >> >> for that use case.
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >> Peter
> >> >> >>
> >> >> >> --
> >> >> >> Peter M. Wolanin, Ph.D.
> >> >> >> Momentum Specialist,  Acquia. Inc.
> >> >> >> peter.wolanin@acquia.com
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Peter M. Wolanin, Ph.D.
> >> >> Momentum Specialist,  Acquia. Inc.
> >> >> peter.wolanin@acquia.com
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Peter M. Wolanin, Ph.D.
> >> Momentum Specialist,  Acquia. Inc.
> >> peter.wolanin@acquia.com
> >>
> >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wolanin@acquia.com
>



-- 
Robert Muir
rcmuir@gmail.com

Re: any docs on solr.EdgeNGramFilterFactory?

Posted by Peter Wolanin <pe...@acquia.com>.
Thanks for the link - there doesn't seem a be a fix version specified,
so I guess this will not officially ship with lucene 2.9?

-Peter

On Wed, Nov 11, 2009 at 10:36 PM, Robert Muir <rc...@gmail.com> wrote:
> Peter, here is a project that does this:
> http://issues.apache.org/jira/browse/LUCENE-1488
>
>
>> That's kind of interesting - in general can I build a custom tokenizer
>> from existing tokenizers that treats different parts of the input
>> differently based on the utf-8 range of the characters?  E.g. use a
>> porter stemmer for stretches of Latin text and n-gram or something
>> else for CJK?
>>
>> -Peter
>>
>> On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
>> <ot...@yahoo.com> wrote:
>> > Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is
>> really just an n-gram tokenizer, so no different than the normal n-gram
>> tokenizer.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Peter Wolanin <pe...@acquia.com>
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Tue, November 10, 2009 7:34:37 PM
>> >> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
>> >>
>> >> So, this is the normal N-gram one?  NGramTokenizerFactory
>> >>
>> >> Digging deeper - there are actualy CJK and Chinese tokenizers in the
>> >> Solr codebase:
>> >>
>> >>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
>> >>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
>> >>
>> >> The CJK one uses the lucene CJKTokenizer
>> >>
>> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
>> >>
>> >> and there seems to be another one even that no one has wrapped into
>> Solr:
>> >>
>> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
>> >>
>> >> So seems like the existing options are a little better than I thought,
>> >> though it would be nice to have some docs on properly configuring
>> >> these.
>> >>
>> >> -Peter
>> >>
>> >> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
>> >> wrote:
>> >> > Peter,
>> >> >
>> >> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but
>> just
>> >> n-grams.
>> >> > Before you take the n-gram route, you may want to look at the smart
>> Chinese
>> >> analyzer in Lucene contrib (I think it works only for Simplified
>> Chinese) and
>> >> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few
>> months
>> >> back.
>> >> >
>> >> > Otis
>> >> > --
>> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message ----
>> >> >> From: Peter Wolanin
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Sent: Tue, November 10, 2009 4:06:52 PM
>> >> >> Subject: any docs on solr.EdgeNGramFilterFactory?
>> >> >>
>> >> >> This fairly recent blog post:
>> >> >>
>> >> >>
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >> >>
>> >> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> >> >> for the index.  I don't see any mention of that tokenizer on the Solr
>> >> >> wiki - is it just waiting to be added, or is there any other
>> >> >> documentation in addition to the blog post?  In particular, there was
>> >> >> a thread last year about using an N-gram tokenizer to enable
>> >> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> >> >> know how people are configuring their schema (with this tokenizer?)
>> >> >> for that use case.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Peter
>> >> >>
>> >> >> --
>> >> >> Peter M. Wolanin, Ph.D.
>> >> >> Momentum Specialist,  Acquia. Inc.
>> >> >> peter.wolanin@acquia.com
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Peter M. Wolanin, Ph.D.
>> >> Momentum Specialist,  Acquia. Inc.
>> >> peter.wolanin@acquia.com
>> >
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wolanin@acquia.com
>>
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com

Re: any docs on solr.EdgeNGramFilterFactory?

Posted by Robert Muir <rc...@gmail.com>.
Peter, here is a project that does this:
http://issues.apache.org/jira/browse/LUCENE-1488


> That's kind of interesting - in general can I build a custom tokenizer
> from existing tokenizers that treats different parts of the input
> differently based on the utf-8 range of the characters?  E.g. use a
> porter stemmer for stretches of Latin text and n-gram or something
> else for CJK?
>
> -Peter
>
> On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
> > Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is
> really just an n-gram tokenizer, so no different than the normal n-gram
> tokenizer.
> >
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >
> >
> >
> > ----- Original Message ----
> >> From: Peter Wolanin <pe...@acquia.com>
> >> To: solr-user@lucene.apache.org
> >> Sent: Tue, November 10, 2009 7:34:37 PM
> >> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
> >>
> >> So, this is the normal N-gram one?  NGramTokenizerFactory
> >>
> >> Digging deeper - there are actualy CJK and Chinese tokenizers in the
> >> Solr codebase:
> >>
> >>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
> >>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
> >>
> >> The CJK one uses the lucene CJKTokenizer
> >>
> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
> >>
> >> and there seems to be another one even that no one has wrapped into
> Solr:
> >>
> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
> >>
> >> So seems like the existing options are a little better than I thought,
> >> though it would be nice to have some docs on properly configuring
> >> these.
> >>
> >> -Peter
> >>
> >> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
> >> wrote:
> >> > Peter,
> >> >
> >> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but
> just
> >> n-grams.
> >> > Before you take the n-gram route, you may want to look at the smart
> Chinese
> >> analyzer in Lucene contrib (I think it works only for Simplified
> Chinese) and
> >> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few
> months
> >> back.
> >> >
> >> > Otis
> >> > --
> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> >> From: Peter Wolanin
> >> >> To: solr-user@lucene.apache.org
> >> >> Sent: Tue, November 10, 2009 4:06:52 PM
> >> >> Subject: any docs on solr.EdgeNGramFilterFactory?
> >> >>
> >> >> This fairly recent blog post:
> >> >>
> >> >>
> >>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >> >>
> >> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
> >> >> for the index.  I don't see any mention of that tokenizer on the Solr
> >> >> wiki - is it just waiting to be added, or is there any other
> >> >> documentation in addition to the blog post?  In particular, there was
> >> >> a thread last year about using an N-gram tokenizer to enable
> >> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
> >> >> know how people are configuring their schema (with this tokenizer?)
> >> >> for that use case.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Peter
> >> >>
> >> >> --
> >> >> Peter M. Wolanin, Ph.D.
> >> >> Momentum Specialist,  Acquia. Inc.
> >> >> peter.wolanin@acquia.com
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Peter M. Wolanin, Ph.D.
> >> Momentum Specialist,  Acquia. Inc.
> >> peter.wolanin@acquia.com
> >
> >
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wolanin@acquia.com
>




-- 
Robert Muir
rcmuir@gmail.com

Re: any docs on solr.EdgeNGramFilterFactory?

Posted by Peter Wolanin <pe...@acquia.com>.
It looks like the CJK one actually does 2-grams plus a little
processing separate processing on latin text.

That's kind of interesting - in general can I build a custom tokenizer
from existing tokenizers that treats different parts of the input
differently based on the utf-8 range of the characters?  E.g. use a
porter stemmer for stretches of Latin text and n-gram or something
else for CJK?

-Peter

On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is really just an n-gram tokenizer, so no different than the normal n-gram tokenizer.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Peter Wolanin <pe...@acquia.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tue, November 10, 2009 7:34:37 PM
>> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
>>
>> So, this is the normal N-gram one?  NGramTokenizerFactory
>>
>> Digging deeper - there are actualy CJK and Chinese tokenizers in the
>> Solr codebase:
>>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
>>
>> The CJK one uses the lucene CJKTokenizer
>> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
>>
>> and there seems to be another one even that no one has wrapped into Solr:
>> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
>>
>> So seems like the existing options are a little better than I thought,
>> though it would be nice to have some docs on properly configuring
>> these.
>>
>> -Peter
>>
>> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
>> wrote:
>> > Peter,
>> >
>> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but just
>> n-grams.
>> > Before you take the n-gram route, you may want to look at the smart Chinese
>> analyzer in Lucene contrib (I think it works only for Simplified Chinese) and
>> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months
>> back.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Peter Wolanin
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Tue, November 10, 2009 4:06:52 PM
>> >> Subject: any docs on solr.EdgeNGramFilterFactory?
>> >>
>> >> This fairly recent blog post:
>> >>
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >>
>> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> >> for the index.  I don't see any mention of that tokenizer on the Solr
>> >> wiki - is it just waiting to be added, or is there any other
>> >> documentation in addition to the blog post?  In particular, there was
>> >> a thread last year about using an N-gram tokenizer to enable
>> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> >> know how people are configuring their schema (with this tokenizer?)
>> >> for that use case.
>> >>
>> >> Thanks,
>> >>
>> >> Peter
>> >>
>> >> --
>> >> Peter M. Wolanin, Ph.D.
>> >> Momentum Specialist,  Acquia. Inc.
>> >> peter.wolanin@acquia.com
>> >
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wolanin@acquia.com
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com

Re: any docs on solr.EdgeNGramFilterFactory?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is really just an n-gram tokenizer, so no different than the normal n-gram tokenizer.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Peter Wolanin <pe...@acquia.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, November 10, 2009 7:34:37 PM
> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
> 
> So, this is the normal N-gram one?  NGramTokenizerFactory
> 
> Digging deeper - there are actualy CJK and Chinese tokenizers in the
> Solr codebase:
> 
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
> 
> The CJK one uses the lucene CJKTokenizer
> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
> 
> and there seems to be another one even that no one has wrapped into Solr:
> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
> 
> So seems like the existing options are a little better than I thought,
> though it would be nice to have some docs on properly configuring
> these.
> 
> -Peter
> 
> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
> wrote:
> > Peter,
> >
> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but just 
> n-grams.
> > Before you take the n-gram route, you may want to look at the smart Chinese 
> analyzer in Lucene contrib (I think it works only for Simplified Chinese) and 
> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months 
> back.
> >
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >
> >
> >
> > ----- Original Message ----
> >> From: Peter Wolanin 
> >> To: solr-user@lucene.apache.org
> >> Sent: Tue, November 10, 2009 4:06:52 PM
> >> Subject: any docs on solr.EdgeNGramFilterFactory?
> >>
> >> This fairly recent blog post:
> >>
> >> 
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >>
> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
> >> for the index.  I don't see any mention of that tokenizer on the Solr
> >> wiki - is it just waiting to be added, or is there any other
> >> documentation in addition to the blog post?  In particular, there was
> >> a thread last year about using an N-gram tokenizer to enable
> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
> >> know how people are configuring their schema (with this tokenizer?)
> >> for that use case.
> >>
> >> Thanks,
> >>
> >> Peter
> >>
> >> --
> >> Peter M. Wolanin, Ph.D.
> >> Momentum Specialist,  Acquia. Inc.
> >> peter.wolanin@acquia.com
> >
> >
> 
> 
> 
> -- 
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wolanin@acquia.com


Re: any docs on solr.EdgeNGramFilterFactory?

Posted by Peter Wolanin <pe...@acquia.com>.
So, this is the normal N-gram one?  NGramTokenizerFactory

Digging deeper - there are actualy CJK and Chinese tokenizers in the
Solr codebase:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html

The CJK one uses the lucene CJKTokenizer
http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html

and there seems to be another one even that no one has wrapped into Solr:
http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html

So seems like the existing options are a little better than I thought,
though it would be nice to have some docs on properly configuring
these.

-Peter

On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Peter,
>
> For CJK and n-grams, I think you don't want the *Edge* n-grams, but just n-grams.
> Before you take the n-gram route, you may want to look at the smart Chinese analyzer in Lucene contrib (I think it works only for Simplified Chinese) and Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months back.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Peter Wolanin <pe...@acquia.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tue, November 10, 2009 4:06:52 PM
>> Subject: any docs on solr.EdgeNGramFilterFactory?
>>
>> This fairly recent blog post:
>>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>
>> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> for the index.  I don't see any mention of that tokenizer on the Solr
>> wiki - is it just waiting to be added, or is there any other
>> documentation in addition to the blog post?  In particular, there was
>> a thread last year about using an N-gram tokenizer to enable
>> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> know how people are configuring their schema (with this tokenizer?)
>> for that use case.
>>
>> Thanks,
>>
>> Peter
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wolanin@acquia.com
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com

Re: any docs on solr.EdgeNGramFilterFactory?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Peter,

For CJK and n-grams, I think you don't want the *Edge* n-grams, but just n-grams.
Before you take the n-gram route, you may want to look at the smart Chinese analyzer in Lucene contrib (I think it works only for Simplified Chinese) and Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months back.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Peter Wolanin <pe...@acquia.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, November 10, 2009 4:06:52 PM
> Subject: any docs on solr.EdgeNGramFilterFactory?
> 
> This fairly recent blog post:
> 
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> 
> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
> for the index.  I don't see any mention of that tokenizer on the Solr
> wiki - is it just waiting to be added, or is there any other
> documentation in addition to the blog post?  In particular, there was
> a thread last year about using an N-gram tokenizer to enable
> reasonable (if not ideal) searching of CJK text, so I'd be curious to
> know how people are configuring their schema (with this tokenizer?)
> for that use case.
> 
> Thanks,
> 
> Peter
> 
> -- 
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wolanin@acquia.com