You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/04/14 15:51:17 UTC

[GitHub] [lucene] janhoy opened a new pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

janhoy opened a new pull request #84:
URL: https://github.com/apache/lucene/pull/84


   See https://issues.apache.org/jira/browse/LUCENE-9929


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

Posted by GitBox <gi...@apache.org>.
janhoy commented on pull request #84:
URL: https://github.com/apache/lucene/pull/84#issuecomment-820993014


   Refactor done, new NorwegianNormalizationFilter added in the 'no' package. Danish and Swedish can be handled in followup issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on pull request #84: LUCENE-9929 NorwegianNormalizationFilter

Posted by GitBox <gi...@apache.org>.
janhoy commented on pull request #84:
URL: https://github.com/apache/lucene/pull/84#issuecomment-823162649


   Ready for a new review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a change in pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #84:
URL: https://github.com/apache/lucene/pull/84#discussion_r613628276



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.java
##########
@@ -33,14 +34,45 @@
  * <p>blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj räksmörgås ==
  * ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
  *
+ * <p>You can choose which of the foldings to apply (aa, ao, ae, oe, oo) through a parameter.
+ *
  * @see ScandinavianFoldingFilter
  */
 public final class ScandinavianNormalizationFilter extends TokenFilter {
 
+  /**
+   * Create the filter with default folding rules, backward compatible with all earlier versions
+   *
+   * @param input the TokenStream
+   */
   public ScandinavianNormalizationFilter(TokenStream input) {
     super(input);
+    this.foldings = ALL_FOLDINGS;
   }
 
+  /**
+   * Create the filter using custom folding rules.
+   *
+   * @param input the TokenStream
+   * @param foldings a Set of Foldings to apply (i.e. AE, OE, AA, AO, OO)
+   */
+  public ScandinavianNormalizationFilter(TokenStream input, Set<Foldings> foldings) {

Review comment:
       The thin-wrappers is just the existing design/organization of lucene/analysis. The idea is that a user has to index a language, they look for their language and use the available tools. When there are a lot of languages (and the user may not be subject expert on each one), it is a good simplification from an API perspective: the user wants to index Danish so they look at the tools in the 'da' package. So please let's avoid any special language parameter.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on pull request #84: LUCENE-9929 NorwegianNormalizationFilter

Posted by GitBox <gi...@apache.org>.
janhoy commented on pull request #84:
URL: https://github.com/apache/lucene/pull/84#issuecomment-839641945


   See latest commit which addresses review comments @jpountz 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on a change in pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

Posted by GitBox <gi...@apache.org>.
janhoy commented on a change in pull request #84:
URL: https://github.com/apache/lucene/pull/84#discussion_r613595880



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.java
##########
@@ -33,14 +34,45 @@
  * <p>blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj räksmörgås ==
  * ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
  *
+ * <p>You can choose which of the foldings to apply (aa, ao, ae, oe, oo) through a parameter.
+ *
  * @see ScandinavianFoldingFilter
  */
 public final class ScandinavianNormalizationFilter extends TokenFilter {
 
+  /**
+   * Create the filter with default folding rules, backward compatible with all earlier versions
+   *
+   * @param input the TokenStream
+   */
   public ScandinavianNormalizationFilter(TokenStream input) {
     super(input);
+    this.foldings = ALL_FOLDINGS;
   }
 
+  /**
+   * Create the filter using custom folding rules.
+   *
+   * @param input the TokenStream
+   * @param foldings a Set of Foldings to apply (i.e. AE, OE, AA, AO, OO)
+   */
+  public ScandinavianNormalizationFilter(TokenStream input, Set<Foldings> foldings) {

Review comment:
       We can obtain a similar Lucene API usability by adding helper vars:
   ```java
   public static final Set<Foldings> ALL_FOLDINGS = Set.of(AA, AO, OO, AE, OE);
   public static final Set<Foldings> NORWEGIAN_FOLDINGS = Set.of(AE, OE, AA);
   public static final Set<Foldings> DANISH_FOLDINGS = NORWEGIAN_FOLDINGS;
   public static final Set<Foldings> SWEDISH_FOLDINGS = ALL_FOLDINGS;
   ```
   In the factory that would translate to perhaps a "language" parameter with predefined settings.
   
   I'm not opposed to thin wrapper filters for each language, but I'd like some feedback from other Scandinavian users on what those should default to.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #84: LUCENE-9929 NorwegianNormalizationFilter

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #84:
URL: https://github.com/apache/lucene/pull/84#discussion_r615920337



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizer.java
##########
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.miscellaneous;
+
+import java.util.Set;
+import org.apache.lucene.analysis.util.StemmerUtil;
+
+/**
+ * This Normalizer does the heavy lifting for a set of Scandinavian normalization filters,
+ * normalizing use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa,
+ * ao, ae, oe and oo) by transforming them to åÅæÆøØ.
+ *
+ * @since 9.0

Review comment:
       should we make it `lucene.internal`, my understanding is that it's really only used as a way to share code between the various scandinavian normalization filters, and that we would make it pkg-private if the scandinavian and the norwegian filters shared the same package?

##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizer.java
##########
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.miscellaneous;
+
+import java.util.Set;
+import org.apache.lucene.analysis.util.StemmerUtil;
+
+/**
+ * This Normalizer does the heavy lifting for a set of Scandinavian normalization filters,
+ * normalizing use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa,
+ * ao, ae, oe and oo) by transforming them to åÅæÆøØ.
+ *
+ * @since 9.0
+ */
+public final class ScandinavianNormalizer {
+
+  /**
+   * Create the instance, while choosing which foldings to apply. This may differ between Norwegian,
+   * Danish and Swedish.
+   *
+   * @param foldings a Set of Foldings to apply (i.e. AE, OE, AA, AO, OO)
+   */
+  public ScandinavianNormalizer(Set<Foldings> foldings) {
+    this.foldings = foldings;
+  }
+
+  /** List of possible foldings that can be used when configuring the filter */
+  public enum Foldings {
+    AA,
+    AO,
+    AE,
+    OE,
+    OO
+  }
+
+  private final Set<Foldings> foldings;
+
+  public static final Set<Foldings> ALL_FOLDINGS =
+      Set.of(Foldings.AA, Foldings.AO, Foldings.OO, Foldings.AE, Foldings.OE);

Review comment:
       Use `EnumSet#allOf`?

##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/no/NorwegianNormalizationFilter.java
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.no;
+
+import java.util.Set;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizationFilter;
+import org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizer.Foldings;
+
+/**
+ * This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded
+ * variants (ae, oe, aa) by transforming them to åÅæÆøØ. This is similar to
+ * ScandinavianNormalizationFilter, except for the folding rules customized for Norwegian.
+ *
+ * <p>blåbærsyltetøj == blåbärsyltetöj == blaabaersyltetoej
+ *
+ * @see ScandinavianNormalizationFilter
+ */
+public final class NorwegianNormalizationFilter extends ScandinavianNormalizationFilter {

Review comment:
       Let's favor composition over inheritance and reuse the ScandinavianNormalizer instead of extending ScandinavianNormalizationFilter?

##########
File path: lucene/analysis/common/src/test/org/apache/lucene/analysis/miscellaneous/TestScandinavianNormalizer.java
##########
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.miscellaneous;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.Set;
+import org.apache.lucene.analysis.*;
+import org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizer.Foldings;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.junit.Test;
+
+/** Tests low level the normalizer functionality */
+public class TestScandinavianNormalizer extends BaseTokenStreamTestCase {
+  @Test

Review comment:
       nit: most test suites we have don't use `@Test` and rely on naming conventions instead, let's not grow the number of `@Test` annotations with this PR?

##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/no/NorwegianNormalizationFilter.java
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.no;
+
+import java.util.Set;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizationFilter;
+import org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizer.Foldings;
+
+/**
+ * This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded
+ * variants (ae, oe, aa) by transforming them to åÅæÆøØ. This is similar to
+ * ScandinavianNormalizationFilter, except for the folding rules customized for Norwegian.
+ *
+ * <p>blåbærsyltetøj == blåbärsyltetöj == blaabaersyltetoej
+ *
+ * @see ScandinavianNormalizationFilter
+ */
+public final class NorwegianNormalizationFilter extends ScandinavianNormalizationFilter {
+  public NorwegianNormalizationFilter(TokenStream input) {
+    super(input, Set.of(Foldings.AE, Foldings.OE, Foldings.AA));

Review comment:
       Use `EnumSet#of` instead of `Set#of` to make this set internally represented as a BitSet?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a change in pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #84:
URL: https://github.com/apache/lucene/pull/84#discussion_r613583132



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.java
##########
@@ -33,14 +34,45 @@
  * <p>blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj räksmörgås ==
  * ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
  *
+ * <p>You can choose which of the foldings to apply (aa, ao, ae, oe, oo) through a parameter.
+ *
  * @see ScandinavianFoldingFilter
  */
 public final class ScandinavianNormalizationFilter extends TokenFilter {
 
+  /**
+   * Create the filter with default folding rules, backward compatible with all earlier versions
+   *
+   * @param input the TokenStream
+   */
   public ScandinavianNormalizationFilter(TokenStream input) {
     super(input);
+    this.foldings = ALL_FOLDINGS;
   }
 
+  /**
+   * Create the filter using custom folding rules.
+   *
+   * @param input the TokenStream
+   * @param foldings a Set of Foldings to apply (i.e. AE, OE, AA, AO, OO)
+   */
+  public ScandinavianNormalizationFilter(TokenStream input, Set<Foldings> foldings) {

Review comment:
       I still don't like this API to the end user. End user may not know which of these are appropriate for each language. Please, see what I stated on the JIRA issue. It isn't breaking any api to expose Norwegian/Swedish/Danish filters. You also don't have to remove the existing Scandinavian one that does all foldings. Nor do you have to duplicate huge chunks of code!
   
   Personally, I would move logic into `ScandinavianNormalizer(Set<Foldings>)` helper that gets used by:
   * existing ScandinavianNormalizationFilter, it just creates `new ScandinavianNormalizer(ALL)` and uses it
   * NorwegianNormalizationFilter, creates `new ScandinanvianNormalizer(???)` and uses it
   * SwedishNormaliationFilter, creates `new ScandinanvianNormalizer(???)` and uses it
   * DanishNormalizatIonFilter, creates `new ScandinanvianNormalizer(???)` and uses it
   
   This way, all 4 filters and their factories are parameter-free. Nobody needs to know anything about how these languages work in order to do the "right" thing, e.g. if they have some norwegian text, they just use the norwegian one, even if they don't have a clue about norwegian orthography.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy merged pull request #84: LUCENE-9929 NorwegianNormalizationFilter

Posted by GitBox <gi...@apache.org>.
janhoy merged pull request #84:
URL: https://github.com/apache/lucene/pull/84


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on a change in pull request #84: LUCENE-9929 Make ScandinavianNormalizationFilter configurable wrt fol…

Posted by GitBox <gi...@apache.org>.
janhoy commented on a change in pull request #84:
URL: https://github.com/apache/lucene/pull/84#discussion_r613994057



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.java
##########
@@ -33,14 +34,45 @@
  * <p>blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj räksmörgås ==
  * ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
  *
+ * <p>You can choose which of the foldings to apply (aa, ao, ae, oe, oo) through a parameter.
+ *
  * @see ScandinavianFoldingFilter
  */
 public final class ScandinavianNormalizationFilter extends TokenFilter {
 
+  /**
+   * Create the filter with default folding rules, backward compatible with all earlier versions
+   *
+   * @param input the TokenStream
+   */
   public ScandinavianNormalizationFilter(TokenStream input) {
     super(input);
+    this.foldings = ALL_FOLDINGS;
   }
 
+  /**
+   * Create the filter using custom folding rules.
+   *
+   * @param input the TokenStream
+   * @param foldings a Set of Foldings to apply (i.e. AE, OE, AA, AO, OO)
+   */
+  public ScandinavianNormalizationFilter(TokenStream input, Set<Foldings> foldings) {

Review comment:
       I'll attempt a refactor and guess what defaults should be, then try to dig up some users who can QA.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org