You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Tomás Fernández Löbbe <to...@gmail.com> on 2016/04/04 21:39:40 UTC

Japanese Tokenizer using User Dictionary

If I understand correctly, the user dictionary in the JapaneseTokenizer
allows users to customize how a stream is broken into tokens using a
specific set of rules provided like:
AABBBCC -> AA BBB CC

It does not allow users to change any of the characters like:

AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC", seems
to only care about positions)

It also doesn't let a character be part of more than one token, like:

AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)

..or make the output token bigger than the input text:

AA -> AAA (Also AIOOBE)

Is this the expected behavior? maybe cases 2-4 should be handled by adding
filters then. If so, is there any cases where the user dictionary should
accept any tokenization were the original text is different than the sum of
the tokens?

Tomás

Re: Japanese Tokenizer using User Dictionary

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

Thanks Christian,
I created https://issues.apache.org/jira/browse/LUCENE-7181

On Mon, Apr 4, 2016 at 11:38 PM, Christian Moen <cm...@atilika.com> wrote:

> Hello again Tomás,
>
> Thanks.  I agree entirely.  If you open a JIRA and I'll have a look and
> make improvements.
>
> Best regards,
>
> Christian Moen
> アティリカ株式会社
> https://www.atilika.com
>
> On Apr 5, 2016, at 15:12, Tomás Fernández Löbbe <to...@gmail.com>
> wrote:
>
> Thanks Christian,
> I don't have a different use case, but If what I said is the expected
> behavior, I think we should validate the User Dictionary at create time
> (and allow only proper tokenization) instead of breaking when using the
> tokenizer.
> If you agree I'll create a Jira for that.
>
> Thanks,
>
> Tomás
>
> On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <cm...@atilika.com> wrote:
>
>> Hello Tomás,
>>
>> What you are describing is the expected behaviour.  If you have any
>> specific use cases that motivate how this perhaps should be changed, I'm
>> very happy learn more about them to see how we can improve things.
>>
>> Many thanks,
>>
>> Christian Moen
>> アティリカ株式会社
>> https://www.atilika.com
>>
>> > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <to...@gmail.com>
>> wrote:
>> >
>> > If I understand correctly, the user dictionary in the JapaneseTokenizer
>> allows users to customize how a stream is broken into tokens using a
>> specific set of rules provided like:
>> > AABBBCC -> AA BBB CC
>> >
>> > It does not allow users to change any of the characters like:
>> >
>> > AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC",
>> seems to only care about positions)
>> >
>> > It also doesn't let a character be part of more than one token, like:
>> >
>> > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
>> >
>> > ..or make the output token bigger than the input text:
>> >
>> > AA -> AAA (Also AIOOBE)
>> >
>> > Is this the expected behavior? maybe cases 2-4 should be handled by
>> adding filters then. If so, is there any cases where the user dictionary
>> should accept any tokenization were the original text is different than the
>> sum of the tokens?
>> >
>> > Tomás
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
>

Re: Japanese Tokenizer using User Dictionary

Posted by Christian Moen <cm...@atilika.com>.

Hello again Tomás,

Thanks.  I agree entirely.  If you open a JIRA and I'll have a look and make improvements.

Best regards,

Christian Moen
アティリカ株式会社
https://www.atilika.com

> On Apr 5, 2016, at 15:12, Tomás Fernández Löbbe <to...@gmail.com> wrote:
> 
> Thanks Christian, 
> I don't have a different use case, but If what I said is the expected behavior, I think we should validate the User Dictionary at create time (and allow only proper tokenization) instead of breaking when using the tokenizer. 
> If you agree I'll create a Jira for that.
> 
> Thanks, 
> 
> Tomás
> 
> On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <cm@atilika.com <ma...@atilika.com>> wrote:
> Hello Tomás,
> 
> What you are describing is the expected behaviour.  If you have any specific use cases that motivate how this perhaps should be changed, I'm very happy learn more about them to see how we can improve things.
> 
> Many thanks,
> 
> Christian Moen
> アティリカ株式会社
> https://www.atilika.com <https://www.atilika.com/>
> 
> > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <tomasflobbe@gmail.com <ma...@gmail.com>> wrote:
> >
> > If I understand correctly, the user dictionary in the JapaneseTokenizer allows users to customize how a stream is broken into tokens using a specific set of rules provided like:
> > AABBBCC -> AA BBB CC
> >
> > It does not allow users to change any of the characters like:
> >
> > AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC", seems to only care about positions)
> >
> > It also doesn't let a character be part of more than one token, like:
> >
> > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
> >
> > ..or make the output token bigger than the input text:
> >
> > AA -> AAA (Also AIOOBE)
> >
> > Is this the expected behavior? maybe cases 2-4 should be handled by adding filters then. If so, is there any cases where the user dictionary should accept any tokenization were the original text is different than the sum of the tokens?
> >
> > Tomás
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <ma...@lucene.apache.org>
> For additional commands, e-mail: dev-help@lucene.apache.org <ma...@lucene.apache.org>
> 
>

Re: Japanese Tokenizer using User Dictionary

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

Thanks Christian,
I don't have a different use case, but If what I said is the expected
behavior, I think we should validate the User Dictionary at create time
(and allow only proper tokenization) instead of breaking when using the
tokenizer.
If you agree I'll create a Jira for that.

Thanks,

Tomás

On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <cm...@atilika.com> wrote:

> Hello Tomás,
>
> What you are describing is the expected behaviour.  If you have any
> specific use cases that motivate how this perhaps should be changed, I'm
> very happy learn more about them to see how we can improve things.
>
> Many thanks,
>
> Christian Moen
> アティリカ株式会社
> https://www.atilika.com
>
> > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <to...@gmail.com>
> wrote:
> >
> > If I understand correctly, the user dictionary in the JapaneseTokenizer
> allows users to customize how a stream is broken into tokens using a
> specific set of rules provided like:
> > AABBBCC -> AA BBB CC
> >
> > It does not allow users to change any of the characters like:
> >
> > AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC",
> seems to only care about positions)
> >
> > It also doesn't let a character be part of more than one token, like:
> >
> > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
> >
> > ..or make the output token bigger than the input text:
> >
> > AA -> AAA (Also AIOOBE)
> >
> > Is this the expected behavior? maybe cases 2-4 should be handled by
> adding filters then. If so, is there any cases where the user dictionary
> should accept any tokenization were the original text is different than the
> sum of the tokens?
> >
> > Tomás
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Japanese Tokenizer using User Dictionary

Posted by Christian Moen <cm...@atilika.com>.

Hello Tomás,

What you are describing is the expected behaviour.  If you have any specific use cases that motivate how this perhaps should be changed, I'm very happy learn more about them to see how we can improve things.

Many thanks,

Christian Moen
アティリカ株式会社
https://www.atilika.com

> On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <to...@gmail.com> wrote:
> 
> If I understand correctly, the user dictionary in the JapaneseTokenizer allows users to customize how a stream is broken into tokens using a specific set of rules provided like: 
> AABBBCC -> AA BBB CC
> 
> It does not allow users to change any of the characters like:
> 
> AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC", seems to only care about positions)
> 
> It also doesn't let a character be part of more than one token, like:
> 
> AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
> 
> ..or make the output token bigger than the input text: 
> 
> AA -> AAA (Also AIOOBE)
> 
> Is this the expected behavior? maybe cases 2-4 should be handled by adding filters then. If so, is there any cases where the user dictionary should accept any tokenization were the original text is different than the sum of the tokens?
> 
> Tomás
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org