You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "JerryChin (via GitHub)" <gi...@apache.org> on 2023/05/16 16:16:16 UTC

[GitHub] [lucene] JerryChin opened a new pull request, #12299: GITHUB-12291: Skip blank lines from stopwords list.

JerryChin opened a new pull request, #12299:
URL: https://github.com/apache/lucene/pull/12299

   ### Description
   
   Hi team,
   
   This PR fixes #12291, it will skip any blank lines when loading stopwords with `WordlistLoader#getWordSet`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12299:
URL: https://github.com/apache/lucene/pull/12299#issuecomment-1552064934

   Looks fine. I will merge this, but please add a CHANGES.txt entry in the 9.7 section.
   
   Thanks for taking care of the issue! 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] matthias-mueller commented on a diff in pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "matthias-mueller (via GitHub)" <gi...@apache.org>.
matthias-mueller commented on code in PR #12299:
URL: https://github.com/apache/lucene/pull/12299#discussion_r1196179445


##########
lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:
##########
@@ -53,7 +53,10 @@ public static CharArraySet getWordSet(Reader reader, CharArraySet result) throws
     try (BufferedReader br = getBufferedReader(reader)) {
       String word = null;
       while ((word = br.readLine()) != null) {
-        result.add(word.trim());
+        word = word.trim();
+        // skip blank lines
+        if (word.length() == 0) continue;

Review Comment:
   Should it better be `word.strip()`? https://stackoverflow.com/a/51266583



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] JerryChin commented on pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "JerryChin (via GitHub)" <gi...@apache.org>.
JerryChin commented on PR #12299:
URL: https://github.com/apache/lucene/pull/12299#issuecomment-1552327703

   Hi @uschindler, which category should I put it under? how about `Improvements`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] matthias-mueller commented on a diff in pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "matthias-mueller (via GitHub)" <gi...@apache.org>.
matthias-mueller commented on code in PR #12299:
URL: https://github.com/apache/lucene/pull/12299#discussion_r1196380870


##########
lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:
##########
@@ -53,7 +53,10 @@ public static CharArraySet getWordSet(Reader reader, CharArraySet result) throws
     try (BufferedReader br = getBufferedReader(reader)) {
       String word = null;
       while ((word = br.readLine()) != null) {
-        result.add(word.trim());
+        word = word.trim();
+        // skip blank lines
+        if (word.length() == 0) continue;

Review Comment:
   @uschindler then sorry for the noise - I was tiggered by `SmartCHINESEAnalyzer` and the lack of unicode whitespace support in `String.trim()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12299:
URL: https://github.com/apache/lucene/pull/12299#issuecomment-1553192538

   I will merge this to 9.x when back at home.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on a diff in pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on code in PR #12299:
URL: https://github.com/apache/lucene/pull/12299#discussion_r1196362845


##########
lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:
##########
@@ -53,7 +53,10 @@ public static CharArraySet getWordSet(Reader reader, CharArraySet result) throws
     try (BufferedReader br = getBufferedReader(reader)) {
       String word = null;
       while ((word = br.readLine()) != null) {
-        result.add(word.trim());
+        word = word.trim();
+        // skip blank lines
+        if (word.length() == 0) continue;

Review Comment:
   I don't want to change this here as this is unrelated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12299:
URL: https://github.com/apache/lucene/pull/12299#issuecomment-1552603008

   Isn't it a Bugfix? Because originally we had an empty Stopword in the set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler merged pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler merged PR #12299:
URL: https://github.com/apache/lucene/pull/12299


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on a diff in pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on code in PR #12299:
URL: https://github.com/apache/lucene/pull/12299#discussion_r1196018225


##########
lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:
##########
@@ -117,7 +120,10 @@ public static CharArraySet getWordSet(Reader reader, String comment, CharArraySe
       String word = null;
       while ((word = br.readLine()) != null) {
         if (word.startsWith(comment) == false) {
-          result.add(word.trim());
+          word = word.trim();
+          // skip blank lines
+          if (word.length() == 0) continue;

Review Comment:
   Use `word.isEmpty()`.



##########
lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:
##########
@@ -53,7 +53,10 @@ public static CharArraySet getWordSet(Reader reader, CharArraySet result) throws
     try (BufferedReader br = getBufferedReader(reader)) {
       String word = null;
       while ((word = br.readLine()) != null) {
-        result.add(word.trim());
+        word = word.trim();
+        // skip blank lines
+        if (word.length() == 0) continue;

Review Comment:
   Use `word.isEmpty()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org