You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/04/14 21:02:08 UTC

[GitHub] [lucene] rmuir commented on a change in pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

rmuir commented on a change in pull request #84:
URL: https://github.com/apache/lucene/pull/84#discussion_r613583132



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.java
##########
@@ -33,14 +34,45 @@
  * <p>blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj räksmörgås ==
  * ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
  *
+ * <p>You can choose which of the foldings to apply (aa, ao, ae, oe, oo) through a parameter.
+ *
  * @see ScandinavianFoldingFilter
  */
 public final class ScandinavianNormalizationFilter extends TokenFilter {
 
+  /**
+   * Create the filter with default folding rules, backward compatible with all earlier versions
+   *
+   * @param input the TokenStream
+   */
   public ScandinavianNormalizationFilter(TokenStream input) {
     super(input);
+    this.foldings = ALL_FOLDINGS;
   }
 
+  /**
+   * Create the filter using custom folding rules.
+   *
+   * @param input the TokenStream
+   * @param foldings a Set of Foldings to apply (i.e. AE, OE, AA, AO, OO)
+   */
+  public ScandinavianNormalizationFilter(TokenStream input, Set<Foldings> foldings) {

Review comment:
       I still don't like this API to the end user. End user may not know which of these are appropriate for each language. Please, see what I stated on the JIRA issue. It isn't breaking any api to expose Norwegian/Swedish/Danish filters. You also don't have to remove the existing Scandinavian one that does all foldings. Nor do you have to duplicate huge chunks of code!
   
   Personally, I would move logic into `ScandinavianNormalizer(Set<Foldings>)` helper that gets used by:
   * existing ScandinavianNormalizationFilter, it just creates `new ScandinavianNormalizer(ALL)` and uses it
   * NorwegianNormalizationFilter, creates `new ScandinanvianNormalizer(???)` and uses it
   * SwedishNormaliationFilter, creates `new ScandinanvianNormalizer(???)` and uses it
   * DanishNormalizatIonFilter, creates `new ScandinanvianNormalizer(???)` and uses it
   
   This way, all 4 filters and their factories are parameter-free. Nobody needs to know anything about how these languages work in order to do the "right" thing, e.g. if they have some norwegian text, they just use the norwegian one, even if they don't have a clue about norwegian orthography.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org