You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Nguyen Minh Gia Huy (Jira)" <ji...@apache.org> on 2020/11/05 06:43:00 UTC
[jira] [Issue Comment Deleted] (LUCENE-9588) Exceptions handling in
methods of SegmentingTokenizerBase
[ https://issues.apache.org/jira/browse/LUCENE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nguyen Minh Gia Huy updated LUCENE-9588:
----------------------------------------
Comment: was deleted
(was: I wonder what should be the appropriate usage of this class ?
Let's say I want a Tokenizer that breaks the text into sentences and send each sentence to another tokenizer, for example WhiteSpaceTokenizer, for segmentation.To do so, I would have to make that tokenizer implement the SegmentingTokenizerBase and invoke the WhiteSpaceTokenizer in the *incrementWord* method. WhiteSpaceTokenizer implements the Tokenizer so it throws I/O exception during analysis.
How the I/O and segmentation could be separated in such cases ? Is SegmentingTokenizerBase intended to limit the usage for only non-i/o segmentation e.g. [HMMChineseTokenizer|https://github.com/apache/lucene-solr/blob/master/lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizer.java#L46] splits sentence by WordSegmenter, which don't require I/O handling ?)
> Exceptions handling in methods of SegmentingTokenizerBase
> ---------------------------------------------------------
>
> Key: LUCENE-9588
> URL: https://issues.apache.org/jira/browse/LUCENE-9588
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 8.6.3
> Reporter: Nguyen Minh Gia Huy
> Priority: Minor
>
> The current interface of *setNextSentence* and *incrementWord* methods in *SegmentingTokenizerBase* do not define the checked exceptions, which makes it troublesome to be inherited.
> For example, if we override the incrementWord with a logic that invoke incrementToken on another tokenizer, the incrementToken raises the IOException but the incrementWord is not defined to handle it.
> I think having setNextSentence and incrementWord handle the IOException would make the SegmentingTokenizerBase easier to be used.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org