You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Nguyen Minh Gia Huy (Jira)" <ji...@apache.org> on 2020/11/05 07:25:00 UTC
[jira] [Comment Edited] (LUCENE-9588) Exceptions handling in methods of SegmentingTokenizerBase

    [ https://issues.apache.org/jira/browse/LUCENE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226541#comment-17226541 ] 

Nguyen Minh Gia Huy edited comment on LUCENE-9588 at 11/5/20, 7:24 AM:
-----------------------------------------------------------------------

Sorry, I didn't explain the example with JapaneseTokenizer very well.

Let's say I want a Tokenizer that breaks the text into sentences and send each sentence to another tokenizer, for example JapaneseTokenizer, for segmentation ( so that the JapaneseTokenizer doesn't analyze the whole paragraph but instead each sentence) .To do so, I would have to make that tokenizer implement the SegmentingTokenizerBase and invoke the JapaneseTokenizer in the *incrementWord* method. JapaneseTokenizer implements the Tokenizer so it throws I/O exception during analysis.

For this specific use case, the *incrementToken* of JapaneseTokenizer ( and any other class implements Tokenizer) combines i/o and segmentation and there seems no way to separate them.

I agree with the idea that SegmentingTokenizerBase handles the I/O itself and that the subclass just deals with only word segmentation. However, the word segmentation probably shouldn't be limited to only non-i/o logic. The existing subclasses of SegmentingTokenizerBase don't have such issue because they don't do word segmentation in Tokenizer style, for example [WordSegmenter|https://github.com/apache/lucene-solr/blob/master/lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizer.java#L46] in HMMChineseTokenizer. I think allowing I/O exception in *setNextSentence* and *incrementWord* will let users have more flexibility with the word segmentation choices and thus improve the usability of this class.


was (Author: huynmg):
Sorry, I didn't explain the example with JapaneseTokenizer very well.

Let's say I want a Tokenizer that breaks the text into sentences and send each sentence to another tokenizer, for example JapaneseTokenizer, for segmentation ( so that the JapaneseTokenizer doesn't analyze the whole paragraph but instead each sentence) .To do so, I would have to make that tokenizer implement the SegmentingTokenizerBase and invoke the JapaneseTokenizer in the *incrementWord* method. JapaneseTokenizer implements the Tokenizer so it throws I/O exception during analysis. 

For this specific use case, the *incrementToken* of JapaneseTokenizer ( and any other class implements Tokenizer) combines i/o and segmentation and there seems no way to separate them.

I agree with the idea that SegmentingTokenizerBase handles the I/O itself and that the subclass just deals with only word segmentation. However, the word segmentation probably shouldn't be limited to only non-i/o logic. The existing subclasses of SegmentingTokenizerBase don't have such issue because they don't do word segmentation in Tokenizer style, for example [WordSegmenter|https://github.com/apache/lucene-solr/blob/master/lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizer.java#L46] in HMMChineseTokenizer. I think allowing I/O exception in *setNextSentence* and *incrementWord* ** will make users have more flexibility with the word segmentation and thus improve the usability of this class.

> Exceptions handling in methods of SegmentingTokenizerBase
> ---------------------------------------------------------
>
>                 Key: LUCENE-9588
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9588
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 8.6.3
>            Reporter: Nguyen Minh Gia Huy
>            Priority: Minor
>
> The current interface of *setNextSentence* and *incrementWord* methods in *SegmentingTokenizerBase* do not define the checked exceptions, which makes it troublesome to be inherited.
> For example, if we override the incrementWord  with a logic that invoke  incrementToken on another tokenizer, the incrementToken raises the IOException but the incrementWord is not defined to handle it.
> I think having setNextSentence and incrementWord handle the IOException would make the SegmentingTokenizerBase easier to be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org