You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "zhipeng93 (via GitHub)" <gi...@apache.org> on 2023/03/07 03:20:29 UTC

[GitHub] [flink-ml] zhipeng93 opened a new pull request, #221: [FLINK-31189] Add HasMaxBucketNum param to StringIndexer

zhipeng93 opened a new pull request, #221:
URL: https://github.com/apache/flink-ml/pull/221

   ## What is the purpose of the change
   
   This PR aims to support the use case when users want to filter out infrequent strings in StringIndexer.
   To achieve this, I have added a `HasMaxBucketNum` param. The default value is `Int.MAX` so it is consistent with the existing behavior.
   
   Note that `HasMaxBucketNum` only works when `stringOrderType` is set as `frequencyDesc`.
   
   ## Brief change log
     - Added `HasMaxBucketNum` param and the corresponding logic.
     - Added java/python test
     - Added java doc.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (no)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes)
     - If yes, how is the feature documented? (avaDocs)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-ml] zhipeng93 merged pull request #221: [FLINK-31189] Add HasMaxIndexNum param to StringIndexer

Posted by "zhipeng93 (via GitHub)" <gi...@apache.org>.
zhipeng93 merged PR #221:
URL: https://github.com/apache/flink-ml/pull/221


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-ml] zhipeng93 commented on pull request #221: [FLINK-31189] Add HasMaxIndexNum param to StringIndexer

Posted by "zhipeng93 (via GitHub)" <gi...@apache.org>.
zhipeng93 commented on PR #221:
URL: https://github.com/apache/flink-ml/pull/221#issuecomment-1491627983

   > Thanks for the PR. LGTM overall.
   > 
   > My only concern is that the term `bucket` is not explicitly introduced before introducing the parameter. `Bucket` appears to be an uncommon term in the context of `StringIndexer`.
   > 
   > To enhance clarity, perhaps we could include a brief description to introduce the concept of `bucket`, or consider renaming it with a more commonly used term, like `index`?
   
   Thanks for the review. I have update the param name to `maxIndexNum`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-ml] Fanoid commented on a diff in pull request #221: [FLINK-31189] Add HasMaxIndexNum param to StringIndexer

Posted by "Fanoid (via GitHub)" <gi...@apache.org>.
Fanoid commented on code in PR #221:
URL: https://github.com/apache/flink-ml/pull/221#discussion_r1154268534


##########
docs/content/docs/operators/feature/stringindexer.md:
##########
@@ -59,9 +59,10 @@ Below are the parameters required by `StringIndexerModel`.
 
 `StringIndexer` needs parameters above and also below.
 
-| Key             | Default       | Type   | Required | Description                                                                                                                         |
-|-----------------|---------------|--------|----------|-------------------------------------------------------------------------------------------------------------------------------------|
-| stringOrderType | `"arbitrary"` | String | no       | How to order strings of each column. Supported values: 'arbitrary', 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'. |
+| Key             | Default       | Type    | Required | Description                                                                                                                         |
+|-----------------|---------------|---------|----------|-------------------------------------------------------------------------------------------------------------------------------------|
+| stringOrderType | `"arbitrary"` | String  | no       | How to order strings of each column. Supported values: 'arbitrary', 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'. |
+| MaxIndexNum     | `2147483647`  | Integer | no       | The max number of indices for each column. It only works when stringOrderType is set as frequencyDesc.                              |

Review Comment:
   nit: add quotes for 'stringOrderType' and 'frequencyDesc', same as other places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-ml] Fanoid commented on pull request #221: [FLINK-31189] Add HasMaxBucketNum param to StringIndexer

Posted by "Fanoid (via GitHub)" <gi...@apache.org>.
Fanoid commented on PR #221:
URL: https://github.com/apache/flink-ml/pull/221#issuecomment-1463476061

   Thanks for the PR. LGTM overall.
   
   My only concern is that the term `bucket` is not explicitly introduced before introducing the parameter. `Bucket` appears to be an uncommon term in the context of `StringIndexer`. 
   
   To enhance clarity, perhaps we could include a brief description to introduce the concept of `bucket`, or consider renaming it with a more commonly used term?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-ml] zhipeng93 commented on pull request #221: [FLINK-31189] Add HasMaxBucketNum param to StringIndexer

Posted by "zhipeng93 (via GitHub)" <gi...@apache.org>.
zhipeng93 commented on PR #221:
URL: https://github.com/apache/flink-ml/pull/221#issuecomment-1457461065

   @Fanoid @lindong28 Can you help to review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-ml] lindong28 commented on pull request #221: [FLINK-31189] Add HasMaxIndexNum param to StringIndexer

Posted by "lindong28 (via GitHub)" <gi...@apache.org>.
lindong28 commented on PR #221:
URL: https://github.com/apache/flink-ml/pull/221#issuecomment-1492803514

   Thanks for the PR. LGTM. Please feel free to merge this PR after resolving the above comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org