You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Joel Bernstein <jo...@gmail.com> on 2017/06/20 15:52:21 UTC

How are people using the ICUTokenizer?

It seems that there are some powerful capabilities in the ICUTokenizer. I
was wondering how the community is making use of it.

Does anyone have experience working with the ICUTokenizer that they can
share?


Joel Bernstein
http://joelsolr.blogspot.com/

Re: How are people using the ICUTokenizer?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I used it in a demo where I searched for Thai words using approximate
English sound-equivalent:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34

I thought that was pretty cool and unexpectedly powerful :-)

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 20 June 2017 at 17:52, Joel Bernstein <jo...@gmail.com> wrote:
> It seems that there are some powerful capabilities in the ICUTokenizer. I
> was wondering how the community is making use of it.
>
> Does anyone have experience working with the ICUTokenizer that they can
> share?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/

Re: How are people using the ICUTokenizer?

Posted by Joel Bernstein <jo...@gmail.com>.

What got me interested was that under the covers the ICUTokenizer is using
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/BreakIterator.html.

Looks like we can get sentences and titles fairly easily and paragraphs
with some extra work.






Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Jun 20, 2017 at 1:54 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> > So, if you are trying to make sure your index breaks words properly on
> eastern languages, just use ICU Tokenizer.
>
> I defer to the expertise on this list, but last I checked ICUTokenizer
> uses dictionary lookup to tokenize CJK.  This may work well for some tasks,
> but I haven't evaluated whether it performs better than smartcn or even
> just cjkbigramfilter on actual retrieval tasks, and I'd be hesitant to
> state "just use" and imply the problem is solved.
>
> I thought I remembered ICUTokenizer not playing well with the
> CJKBigramFilter, but it appears to be working in 6.6.
>
> > use the ICUNormalizer
> I could not agree with this more.
>
> -----Original Message-----
> From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.davis@nih.gov]
> Sent: Tuesday, June 20, 2017 12:02 PM
> To: solr-user@lucene.apache.org
> Subject: RE: How are people using the ICUTokenizer?
>
> Joel,
>
> I think the issue is doing word-breaking according to ICU rules.   So, if
> you are trying to make sure your index breaks words properly on eastern
> languages, just use ICU Tokenizer.   Unless your text is already in an ICU
> normal form, you should always use the ICUNormalizer character filter along
> with this:
>
> https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#
> CharFilterFactories-solr.ICUNormalizer2CharFilterFactory
>
> I think that this would be good with Shingles when you are not removing
> stop words, maybe in an alternate analysis of the same content.
>
> I'm using it in this way, with shingles for phrase recognition and only
> doc freq and term freq - my possibly naïve idea is that I do not need
> positions and offsets if I'm using shingles, and my main goal is to do a
> MoreLikeThis query using the shingled versions of fields.
>
> -----Original Message-----
> From: Joel Bernstein [mailto:joelsolr@gmail.com]
> Sent: Tuesday, June 20, 2017 11:52 AM
> To: solr-user@lucene.apache.org
> Subject: How are people using the ICUTokenizer?
>
> It seems that there are some powerful capabilities in the ICUTokenizer. I
> was wondering how the community is making use of it.
>
> Does anyone have experience working with the ICUTokenizer that they can
> share?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>

RE: How are people using the ICUTokenizer?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

> So, if you are trying to make sure your index breaks words properly on eastern languages, just use ICU Tokenizer.   

I defer to the expertise on this list, but last I checked ICUTokenizer uses dictionary lookup to tokenize CJK.  This may work well for some tasks, but I haven't evaluated whether it performs better than smartcn or even just cjkbigramfilter on actual retrieval tasks, and I'd be hesitant to state "just use" and imply the problem is solved.  

I thought I remembered ICUTokenizer not playing well with the CJKBigramFilter, but it appears to be working in 6.6.

> use the ICUNormalizer
I could not agree with this more.  

-----Original Message-----
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.davis@nih.gov] 
Sent: Tuesday, June 20, 2017 12:02 PM
To: solr-user@lucene.apache.org
Subject: RE: How are people using the ICUTokenizer?

Joel,

I think the issue is doing word-breaking according to ICU rules.   So, if you are trying to make sure your index breaks words properly on eastern languages, just use ICU Tokenizer.   Unless your text is already in an ICU normal form, you should always use the ICUNormalizer character filter along with this:

https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.ICUNormalizer2CharFilterFactory

I think that this would be good with Shingles when you are not removing stop words, maybe in an alternate analysis of the same content.

I'm using it in this way, with shingles for phrase recognition and only doc freq and term freq - my possibly naïve idea is that I do not need positions and offsets if I'm using shingles, and my main goal is to do a MoreLikeThis query using the shingled versions of fields.

-----Original Message-----
From: Joel Bernstein [mailto:joelsolr@gmail.com] 
Sent: Tuesday, June 20, 2017 11:52 AM
To: solr-user@lucene.apache.org
Subject: How are people using the ICUTokenizer?

It seems that there are some powerful capabilities in the ICUTokenizer. I was wondering how the community is making use of it.

Does anyone have experience working with the ICUTokenizer that they can share?


Joel Bernstein
http://joelsolr.blogspot.com/

RE: How are people using the ICUTokenizer?

Posted by "Davis, Daniel (NIH/NLM) [C]" <da...@nih.gov>.

The GUI is not built yet, so the jury is out.   I plan to include switches to do the MoreLikeThis both ways, but I think it will do a better job because this is a specific case study/example in classification in the book Taming Text by Grant Ingersoll.   It is a reasonable assumption that he knows more than I do.

-----Original Message-----
From: David Hastings [mailto:hastings.recursive@gmail.com] 
Sent: Tuesday, June 20, 2017 12:13 PM
To: solr-user@lucene.apache.org
Subject: Re: How are people using the ICUTokenizer?

Have you successfully used the shingles with the MoreLikeThis query?
Really curious about if this would to return the "interesting Phrases"

On Tue, Jun 20, 2017 at 12:01 PM, Davis, Daniel (NIH/NLM) [C] < daniel.davis@nih.gov> wrote:

> Joel,
>
> I think the issue is doing word-breaking according to ICU rules.   So, if
> you are trying to make sure your index breaks words properly on eastern
> languages, just use ICU Tokenizer.   Unless your text is already in an ICU
> normal form, you should always use the ICUNormalizer character filter 
> along with this:
>
> https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#
> CharFilterFactories-solr.ICUNormalizer2CharFilterFactory
>
> I think that this would be good with Shingles when you are not 
> removing stop words, maybe in an alternate analysis of the same content.
>
> I'm using it in this way, with shingles for phrase recognition and 
> only doc freq and term freq - my possibly naïve idea is that I do not 
> need positions and offsets if I'm using shingles, and my main goal is 
> to do a MoreLikeThis query using the shingled versions of fields.
>
> -----Original Message-----
> From: Joel Bernstein [mailto:joelsolr@gmail.com]
> Sent: Tuesday, June 20, 2017 11:52 AM
> To: solr-user@lucene.apache.org
> Subject: How are people using the ICUTokenizer?
>
> It seems that there are some powerful capabilities in the 
> ICUTokenizer. I was wondering how the community is making use of it.
>
> Does anyone have experience working with the ICUTokenizer that they 
> can share?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>

Re: How are people using the ICUTokenizer?

Posted by David Hastings <ha...@gmail.com>.

Have you successfully used the shingles with the MoreLikeThis query?
Really curious about if this would to return the "interesting Phrases"

On Tue, Jun 20, 2017 at 12:01 PM, Davis, Daniel (NIH/NLM) [C] <
daniel.davis@nih.gov> wrote:

> Joel,
>
> I think the issue is doing word-breaking according to ICU rules.   So, if
> you are trying to make sure your index breaks words properly on eastern
> languages, just use ICU Tokenizer.   Unless your text is already in an ICU
> normal form, you should always use the ICUNormalizer character filter along
> with this:
>
> https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#
> CharFilterFactories-solr.ICUNormalizer2CharFilterFactory
>
> I think that this would be good with Shingles when you are not removing
> stop words, maybe in an alternate analysis of the same content.
>
> I'm using it in this way, with shingles for phrase recognition and only
> doc freq and term freq - my possibly naïve idea is that I do not need
> positions and offsets if I'm using shingles, and my main goal is to do a
> MoreLikeThis query using the shingled versions of fields.
>
> -----Original Message-----
> From: Joel Bernstein [mailto:joelsolr@gmail.com]
> Sent: Tuesday, June 20, 2017 11:52 AM
> To: solr-user@lucene.apache.org
> Subject: How are people using the ICUTokenizer?
>
> It seems that there are some powerful capabilities in the ICUTokenizer. I
> was wondering how the community is making use of it.
>
> Does anyone have experience working with the ICUTokenizer that they can
> share?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>

RE: How are people using the ICUTokenizer?

Posted by "Davis, Daniel (NIH/NLM) [C]" <da...@nih.gov>.

Joel,

I think the issue is doing word-breaking according to ICU rules.   So, if you are trying to make sure your index breaks words properly on eastern languages, just use ICU Tokenizer.   Unless your text is already in an ICU normal form, you should always use the ICUNormalizer character filter along with this:

https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.ICUNormalizer2CharFilterFactory

I think that this would be good with Shingles when you are not removing stop words, maybe in an alternate analysis of the same content.

I'm using it in this way, with shingles for phrase recognition and only doc freq and term freq - my possibly naïve idea is that I do not need positions and offsets if I'm using shingles, and my main goal is to do a MoreLikeThis query using the shingled versions of fields.

-----Original Message-----
From: Joel Bernstein [mailto:joelsolr@gmail.com] 
Sent: Tuesday, June 20, 2017 11:52 AM
To: solr-user@lucene.apache.org
Subject: How are people using the ICUTokenizer?

It seems that there are some powerful capabilities in the ICUTokenizer. I was wondering how the community is making use of it.

Does anyone have experience working with the ICUTokenizer that they can share?


Joel Bernstein
http://joelsolr.blogspot.com/