You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by "Tejaskriya (via GitHub)" <gi...@apache.org> on 2023/09/12 08:40:48 UTC

[GitHub] [ozone] Tejaskriya opened a new pull request, #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Tejaskriya opened a new pull request, #5276:
URL: https://github.com/apache/ozone/pull/5276

   ## What changes were proposed in this pull request?
   
   Ozone was unintentionally supporting Unicode non-ascii alphabetic characters (like Æ, ｚ, etc.) for volume and bucket names as Character.lowercase method is unicode aware. The check is now done using regex to avoid this behaviour.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-8920
   
   ## How was this patch tested?
   
   Added a unit test in TestHddsClientUtils. Also tested manually in docker set-up:
   ```
   bash-4.2$ ozone sh volume create Áaa
   INVALID_VOLUME_NAME Bucket or Volume name has an unsupported character : ?
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1323517345


##########
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/client/HddsClientUtils.java:
##########
@@ -132,9 +132,8 @@ private static boolean isSupportedCharacter(char c, boolean isStrictS3) {
     // ozone allows namespace to follow other volume/bucket naming convention,
     // for example, here supports '_',
     // which is a valid character in POSIX-compliant system, like HDFS.
-    return (c == '.' || c == '-' ||
-        Character.isLowerCase(c) || Character.isDigit(c)) ||
-        (c == '_' && !isStrictS3);
+    return (OzoneConsts.SUPPORTED_CHARACTER_CHECK_REGEX
+        .matcher(Character.toString(c)).matches()) || (c == '_' && !isStrictS3);

Review Comment:
   Since now we are running the regex over each character rather than the whole string. Could it be a lot slower?
   
   In theory, we could check the whole name string **once** with the regex (with an extra `+` at the end of course) in `verifyResourceName` to avoid checking **each** character one by one. But that would require some refactoring on the whole resource name checking logic. So we will need to avoid disrupting the functionality of the `isStrictS3` config.
   
   @Tejaskriya Can we do a quick unit test benchmark of the `isSupportedCharacter` method, e.g. run this method over a 1 million characters loop, and compare with the old checker, to see if it is indeed slower? And if so, by how much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.

sodonnel commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1324289293


##########
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/client/HddsClientUtils.java:
##########
@@ -132,9 +132,17 @@ private static boolean isSupportedCharacter(char c, boolean isStrictS3) {
     // ozone allows namespace to follow other volume/bucket naming convention,
     // for example, here supports '_',
     // which is a valid character in POSIX-compliant system, like HDFS.
-    return (c == '.' || c == '-' ||
-        Character.isLowerCase(c) || Character.isDigit(c)) ||
-        (c == '_' && !isStrictS3);
+    int code = c;
+    if (code >= 47 && code <= 57) { // 0 - 9

Review Comment:
   Well spotted on the ACSII code error. Using the actual characters is better, I agree. I wasn't aware you could do it like this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] Tejaskriya commented on pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "Tejaskriya (via GitHub)" <gi...@apache.org>.

Tejaskriya commented on PR #5276:
URL: https://github.com/apache/ozone/pull/5276#issuecomment-1717006453

   @sodonnel Thank you so much for benchmarking this! Following the results, I have changed the approach to make the checks with if-else statements using the ascii values. 
   @smengcl could you please verify if the changes are good to go now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] adoroszlai commented on pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.

adoroszlai commented on PR #5276:
URL: https://github.com/apache/ozone/pull/5276#issuecomment-1716141798

   > I worry slightly about needing to use a (simple) regex on every character of a key
   
   I guess it's only needed to be able to show the offending character.  How about this:
   
   1. check the whole string against the regex
   2. only if it fails to match, find (all) offending characters
   
   This might reduce processing time for the happy path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl merged pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl merged PR #5276:
URL: https://github.com/apache/ozone/pull/5276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1323517345


##########
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/client/HddsClientUtils.java:
##########
@@ -132,9 +132,8 @@ private static boolean isSupportedCharacter(char c, boolean isStrictS3) {
     // ozone allows namespace to follow other volume/bucket naming convention,
     // for example, here supports '_',
     // which is a valid character in POSIX-compliant system, like HDFS.
-    return (c == '.' || c == '-' ||
-        Character.isLowerCase(c) || Character.isDigit(c)) ||
-        (c == '_' && !isStrictS3);
+    return (OzoneConsts.SUPPORTED_CHARACTER_CHECK_REGEX
+        .matcher(Character.toString(c)).matches()) || (c == '_' && !isStrictS3);

Review Comment:
   As Stephen already mentioned [above](https://github.com/apache/ozone/pull/5276#issuecomment-1715564621), since now we are running the regex over each character rather than the whole string. Could it be a lot slower?
   
   In theory, we could check the whole name string **once** with the regex (with an extra `+` at the end of course) in `verifyResourceName` to avoid checking **each** character one by one. But that would require some refactoring on the whole resource name checking logic. So we will need to avoid disrupting the functionality of the `isStrictS3` config.
   
   @Tejaskriya Can we do a quick unit test benchmark of the `isSupportedCharacter` method, e.g. run this method over a 1 million characters loop, and compare with the old checker, to see if it is indeed slower? And if so, by how much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1323517345


##########
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/client/HddsClientUtils.java:
##########
@@ -132,9 +132,8 @@ private static boolean isSupportedCharacter(char c, boolean isStrictS3) {
     // ozone allows namespace to follow other volume/bucket naming convention,
     // for example, here supports '_',
     // which is a valid character in POSIX-compliant system, like HDFS.
-    return (c == '.' || c == '-' ||
-        Character.isLowerCase(c) || Character.isDigit(c)) ||
-        (c == '_' && !isStrictS3);
+    return (OzoneConsts.SUPPORTED_CHARACTER_CHECK_REGEX
+        .matcher(Character.toString(c)).matches()) || (c == '_' && !isStrictS3);

Review Comment:
   Since now we are now running the regex over each character rather than a whole string. Could it be a lot slower?
   
   In theory, we could check the whole name string **once** with the regex (with an extra `+` at the end of course) in `verifyResourceName` to avoid checking **each** character one by one. But that would require some refactoring on the whole resource name checking logic, and need to avoid disrupting the functionality of the `isStrictS3` config.
   
   @Tejaskriya Can we do a quick unit test benchmark of the `isSupportedCharacter` method, e.g. run this method over a 1 million characters loop, and compare with the old checker, to see if it is indeed slower? And if so, by how much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] Tejaskriya commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "Tejaskriya (via GitHub)" <gi...@apache.org>.

Tejaskriya commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1324334758


##########
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/client/HddsClientUtils.java:
##########
@@ -132,9 +132,17 @@ private static boolean isSupportedCharacter(char c, boolean isStrictS3) {
     // ozone allows namespace to follow other volume/bucket naming convention,
     // for example, here supports '_',
     // which is a valid character in POSIX-compliant system, like HDFS.
-    return (c == '.' || c == '-' ||
-        Character.isLowerCase(c) || Character.isDigit(c)) ||
-        (c == '_' && !isStrictS3);
+    int code = c;
+    if (code >= 47 && code <= 57) { // 0 - 9

Review Comment:
   Thank you for the suggestion, I've changed it to character comparison now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.

sodonnel commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1322884638


##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java:
##########
@@ -434,6 +434,10 @@ private OzoneConsts() {
   public static final Pattern KEYNAME_ILLEGAL_CHARACTER_CHECK_REGEX  =
           Pattern.compile("^[^\\\\{}<>^%~#|`\\[\\]\"\\x80-\\xff]+$");
 
+  // Supported characters for resource names when isStrictS3 is set as false
+  public static final Pattern SUPPORTED_CHARACTER_CHECK_REGEX  =
+      Pattern.compile("[a-z0-9.-]");

Review Comment:
   I was suspicious about the final dash ( "-" )  character in the regex as the dash character is used for ranges inside the character classes, and I thought it needed escaped. However I tested a few regexes and it seems to work fine, and if you have an invalid range the regex fails to compile. I am just leaving this comment here incase someone else thinks to check this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on PR #5276:
URL: https://github.com/apache/ozone/pull/5276#issuecomment-1718354416

   Thanks @Tejaskriya for the patch. Thanks @sodonnel and @adoroszlai for the reviews and suggestions.
   
   Glad that we have a clean and performant solution here :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1323515821


##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java:
##########
@@ -434,6 +434,10 @@ private OzoneConsts() {
   public static final Pattern KEYNAME_ILLEGAL_CHARACTER_CHECK_REGEX  =
           Pattern.compile("^[^\\\\{}<>^%~#|`\\[\\]\"\\x80-\\xff]+$");
 
+  // Supported characters for resource names when isStrictS3 is set as false
+  public static final Pattern SUPPORTED_CHARACTER_CHECK_REGEX  =
+      Pattern.compile("[a-z0-9.-]");

Review Comment:
   Indeed.
   
   In theory we could check the whole name string **once** with the regex (with an extra `+` at the end of course) in `verifyResourceName` to avoid checking **each** character one by one. But that would require some refactoring on the whole resource name checking logic, and need to avoid disrupting the functionality of the `isStrictS3` config.
   
   @Tejaskriya Can we do a quick unit test benchmark of the `isSupportedCharacter` method, e.g. run it over a 1 million characters and compare with the old checker, to see if it is indeed slower? And if so, by how much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1323517345


##########
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/client/HddsClientUtils.java:
##########
@@ -132,9 +132,8 @@ private static boolean isSupportedCharacter(char c, boolean isStrictS3) {
     // ozone allows namespace to follow other volume/bucket naming convention,
     // for example, here supports '_',
     // which is a valid character in POSIX-compliant system, like HDFS.
-    return (c == '.' || c == '-' ||
-        Character.isLowerCase(c) || Character.isDigit(c)) ||
-        (c == '_' && !isStrictS3);
+    return (OzoneConsts.SUPPORTED_CHARACTER_CHECK_REGEX
+        .matcher(Character.toString(c)).matches()) || (c == '_' && !isStrictS3);

Review Comment:
   As Stephen already mentioned [above](https://github.com/apache/ozone/pull/5276#issuecomment-1715564621), since we are running the regex over each character rather than the whole string in one shot. Could it be a lot slower?
   
   In theory, we could check the whole name string **once** with the regex (with an extra `+` at the end of course) in `verifyResourceName` to avoid checking **each** character one by one. But that would require some refactoring on the whole resource name checking logic. So we will need to avoid disrupting the functionality of the `isStrictS3` config.
   
   @Tejaskriya Can we do a quick unit test benchmark of the `isSupportedCharacter` method, e.g. run this method over a 1 million characters loop, and compare with the old checker, to see if it is indeed slower? And if so, by how much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1323517345


##########
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/client/HddsClientUtils.java:
##########
@@ -132,9 +132,8 @@ private static boolean isSupportedCharacter(char c, boolean isStrictS3) {
     // ozone allows namespace to follow other volume/bucket naming convention,
     // for example, here supports '_',
     // which is a valid character in POSIX-compliant system, like HDFS.
-    return (c == '.' || c == '-' ||
-        Character.isLowerCase(c) || Character.isDigit(c)) ||
-        (c == '_' && !isStrictS3);
+    return (OzoneConsts.SUPPORTED_CHARACTER_CHECK_REGEX
+        .matcher(Character.toString(c)).matches()) || (c == '_' && !isStrictS3);

Review Comment:
   Since now we are running the regex over each character rather than the whole string. Could it be a lot slower?
   
   In theory, we could check the whole name string **once** with the regex (with an extra `+` at the end of course) in `verifyResourceName` to avoid checking **each** character one by one. But that would require some refactoring on the whole resource name checking logic, and need to avoid disrupting the functionality of the `isStrictS3` config.
   
   @Tejaskriya Can we do a quick unit test benchmark of the `isSupportedCharacter` method, e.g. run this method over a 1 million characters loop, and compare with the old checker, to see if it is indeed slower? And if so, by how much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1324193653


##########
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/client/HddsClientUtils.java:
##########
@@ -132,9 +132,17 @@ private static boolean isSupportedCharacter(char c, boolean isStrictS3) {
     // ozone allows namespace to follow other volume/bucket naming convention,
     // for example, here supports '_',
     // which is a valid character in POSIX-compliant system, like HDFS.
-    return (c == '.' || c == '-' ||
-        Character.isLowerCase(c) || Character.isDigit(c)) ||
-        (c == '_' && !isStrictS3);
+    int code = c;
+    if (code >= 47 && code <= 57) { // 0 - 9

Review Comment:
   Why not just use `char` here? Also makes the code more readable:
   
   ```suggestion
       if (c >= '0' && c <= '9') {
   ```
   
   Same applies to two other conditions below.
   
   (Also, `0` is decimal 48 in the [ASCII table](https://www.asciitable.com/), not 47.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.

sodonnel commented on PR #5276:
URL: https://github.com/apache/ozone/pull/5276#issuecomment-1715564621

   This change looks correct to me, but I worry slightly about needing to use a (simple) regex on every character of a key. Its going to churn a lot of small objects on a hot path (key creation) as each match needs to create a new matcher object.
   
   The valid characters are only a-z, 0 - 9, . and -, so we have 38 characters. 
   
   Could we have a method that checks the ascii range of the character, eg:
   
   ```
   int code = (int) character;
   if (code >= 47 && code <= 57) { // 0 - 9 
     return true;
   } else if (code >= 97 && code <= 122) {  // a-z
     return true;
   } else if (code == 45) { // "-"
     return true;
   } else if (code == 46) { // "."
     return true;
   } else {
     return false;
   }
   ```
   
   I honestly don't know if this method is better than the regex method performance wise. The regex solution is certainly nicer code, but it would be interesting to try a test using the micro-benchmark framework (JMH) to see how big the performance difference is for strings of a reasonable length (eg 100 - 500 characters or so).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] smengcl commented on a diff in pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "smengcl (via GitHub)" <gi...@apache.org>.

smengcl commented on code in PR #5276:
URL: https://github.com/apache/ozone/pull/5276#discussion_r1323515821


##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java:
##########
@@ -434,6 +434,10 @@ private OzoneConsts() {
   public static final Pattern KEYNAME_ILLEGAL_CHARACTER_CHECK_REGEX  =
           Pattern.compile("^[^\\\\{}<>^%~#|`\\[\\]\"\\x80-\\xff]+$");
 
+  // Supported characters for resource names when isStrictS3 is set as false
+  public static final Pattern SUPPORTED_CHARACTER_CHECK_REGEX  =
+      Pattern.compile("[a-z0-9.-]");

Review Comment:
   Indeed.
   
   In theory we could check the whole name string **once** with the regex (with an extra `+` at the end of course) in `verifyResourceName` to avoid checking **each** character one by one. But that would require some refactoring on the whole resource name checking logic, and need to avoid disrupting the functionality of the `isStrictS3` config.
   
   @Tejaskriya Can we do a quick unit test benchmark of the `isSupportedCharacter` method, e.g. run it over a 1 million characters and compare with the old checker, to see if it is indeed slower? And if so, by how much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #5276: HDDS-8920. Ozone is supporting unicode volume and bucket names, potentially unintentionally

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.

sodonnel commented on PR #5276:
URL: https://github.com/apache/ozone/pull/5276#issuecomment-1716486117

   I decided to have a go at benchmarking this with JMH, using some code like in this PR, using a single Regex against the entire string and with the series of IF statements. The code looks like:
   
   ```
   package com.sodonnell.microbench;
   
   import org.openjdk.jmh.annotations.Benchmark;
   import org.openjdk.jmh.annotations.BenchmarkMode;
   import org.openjdk.jmh.annotations.Fork;
   import org.openjdk.jmh.annotations.Measurement;
   import org.openjdk.jmh.annotations.Mode;
   import org.openjdk.jmh.annotations.Warmup;
   
   import java.util.regex.Pattern;
   
   import static java.util.concurrent.TimeUnit.MILLISECONDS;
   
   public class Strings {
   
     private final static Pattern pattern = Pattern.compile("[a-z0-9.-]");
     private final static Pattern patternAll = Pattern.compile("^[a-z0-9.-]+$");
   
     private static String str = "hellothisisastringthatiwilltestoutforkeysetc123456789hellothisisastringthatiwilltestoutforkeysetc123456789";
   
   
     public static void main(String[] args) throws Exception {
       String[] opts = new String[2];
       opts[0] = "-prof";
       opts[1] = "gc";
   
       org.openjdk.jmh.Main.main(opts);
     }
   
     @Benchmark
     @Warmup(iterations = 1)
     @Fork(value = 1, warmups = 1)
     @Measurement(iterations = 3, time = 5000, timeUnit = MILLISECONDS)
     @BenchmarkMode(Mode.Throughput)
     public boolean regexp() {
       for (int i = 0; i < str.length(); i++) {
         if (!pattern.matcher(Character.toString(str.charAt(i))).matches()) {
           return false;
         }
       }
       return true;
     }
   
     @Benchmark
     @Warmup(iterations = 1)
     @Fork(value = 1, warmups = 1)
     @Measurement(iterations = 3, time = 5000, timeUnit = MILLISECONDS)
     @BenchmarkMode(Mode.Throughput)
     public boolean regexpFull() {
       return patternAll.matcher(str).matches();
     }
   
     @Benchmark
     @Warmup(iterations = 1)
     @Fork(value = 1, warmups = 1)
     @Measurement(iterations = 3, time = 5000, timeUnit = MILLISECONDS)
     @BenchmarkMode(Mode.Throughput)
     public boolean table() {
       for (int i = 0; i < str.length(); i++) {
         if (!isValid(str.charAt(i))) {
           return false;
         }
       }
       return true;
     }
   
     private boolean isValid(int code) {
       if (code >= 47 && code <= 57) { // 0 - 9
         return true;
       } else if (code >= 97 && code <= 122) {  // a-z
         return true;
       } else if (code == 45) { // "-"
         return true;
       } else if (code == 46) { // "."
         return true;
       } else {
         return false;
       }
     }
   
   }
   ```
   
   The results look like:
   
   ```
   Benchmark                                             Mode  Cnt         Score        Error   Units
   Strings.regexp                                       thrpt    3    418466.343 ± 257782.725   ops/s
   Strings.regexp:·gc.alloc.rate                        thrpt    3      6755.787 ±   4084.420  MB/sec
   Strings.regexp:·gc.alloc.rate.norm                   thrpt    3     18656.000 ±      0.001    B/op
   Strings.regexp:·gc.churn.G1_Eden_Space               thrpt    3      6754.113 ±   2631.464  MB/sec
   Strings.regexp:·gc.churn.G1_Eden_Space.norm          thrpt    3     18656.407 ±   4263.512    B/op
   Strings.regexp:·gc.count                             thrpt    3       150.000               counts
   Strings.regexp:·gc.time                              thrpt    3       124.000                   ms
   
   Strings.regexpFull                                   thrpt    3   3012970.853 ± 504540.390   ops/s
   Strings.regexpFull:·gc.alloc.rate                    thrpt    3       521.652 ±     85.555  MB/sec
   Strings.regexpFull:·gc.alloc.rate.norm               thrpt    3       200.000 ±      0.001    B/op
   Strings.regexpFull:·gc.churn.G1_Eden_Space           thrpt    3       530.936 ±    555.269  MB/sec
   Strings.regexpFull:·gc.churn.G1_Eden_Space.norm      thrpt    3       203.500 ±    180.451    B/op
   Strings.regexpFull:·gc.count                         thrpt    3        29.000               counts
   Strings.regexpFull:·gc.time                          thrpt    3        44.000                   ms
   
   Strings.table                                        thrpt    3  18801601.921 ± 852611.165   ops/s
   Strings.table:·gc.alloc.rate                         thrpt    3        ≈ 10⁻⁴               MB/sec
   Strings.table:·gc.alloc.rate.norm                    thrpt    3        ≈ 10⁻⁵                 B/op
   Strings.table:·gc.count                              thrpt    3           ≈ 0               counts
   
   Process finished with exit code 0
   ```
   
   Notice the "if statements" (Strings.table here) is about 18.8M ops per second vs 3M ops per seconds for a single regex and down to about 400k per second for the technique in this PR.
   
   Also notable is the GC churn created by the two regex approaches, with virtually no GC from the IF statements, which should help the throughput of the JVM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org