You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/04 17:10:51 UTC

[GitHub] [arrow] Jedi18 opened a new pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Jedi18 opened a new pull request #11858:
URL: https://github.com/apache/arrow/pull/11858


   Add option to support URI decoding of key in hive partitioning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] houqp commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r764562077



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -106,6 +106,8 @@ struct ARROW_DS_EXPORT KeyValuePartitioningOptions {
   /// After splitting a path into components, decode the path components
   /// before parsing according to this scheme.
   SegmentEncoding segment_encoding = SegmentEncoding::Uri;
+  // Should the key be decoded according to the scheme above too
+  bool decode_key = false;

Review comment:
       @zijie0 has a minimal reproducible example at: https://github.com/delta-io/delta-rs/issues/495#issue-1051108818. I would also prefer we do decode by default. If this could lead to breaking changes, then it makes sense to make it configurable, otherwise, it would be good to remove the option entirely.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r763187133



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -106,6 +106,8 @@ struct ARROW_DS_EXPORT KeyValuePartitioningOptions {
   /// After splitting a path into components, decode the path components
   /// before parsing according to this scheme.
   SegmentEncoding segment_encoding = SegmentEncoding::Uri;
+  // Should the key be decoded according to the scheme above too
+  bool decode_key = false;

Review comment:
       Is there actually a reason to make this configurable? Is there a situation where we would want to leave special characters unescaped? @lidavidm 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r763220893



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -106,6 +106,8 @@ struct ARROW_DS_EXPORT KeyValuePartitioningOptions {
   /// After splitting a path into components, decode the path components
   /// before parsing according to this scheme.
   SegmentEncoding segment_encoding = SegmentEncoding::Uri;
+  // Should the key be decoded according to the scheme above too
+  bool decode_key = false;

Review comment:
       For reference, I tried to get PySpark to create a partition key with %-encoded paths but it doesn't seem to do the encoding on local file system.
   
   However, I think we can just apply the decoding unilaterally here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Jedi18 commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
Jedi18 commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r763717896



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -106,6 +106,8 @@ struct ARROW_DS_EXPORT KeyValuePartitioningOptions {
   /// After splitting a path into components, decode the path components
   /// before parsing according to this scheme.
   SegmentEncoding segment_encoding = SegmentEncoding::Uri;
+  // Should the key be decoded according to the scheme above too
+  bool decode_key = false;

Review comment:
       Should I default decode_key to true or remove it entirely so that the key is always decoded?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#issuecomment-992882363


   Benchmark runs are scheduled for baseline = dee4ba31773e3ba854e91f8b5df1423a0c88f352 and contender = 11be9c542b9699b7eb4ae1656775c9b30639e415. 11be9c542b9699b7eb4ae1656775c9b30639e415 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6ce07ac40ab1440d849beff3a82db30e...c44db1870bbf4b608ed92bacb301ecf2/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/51b244431a4a48ab954e76aa8d1bc634...c9dd8a17747f413a8fb6c0734c585bdb/)
   [Finished :arrow_down:0.62% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/41a2ef22f9994aa0893b7c2658db4a07...3f5e418679cb4372ab08b307c241c0f9/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#issuecomment-992882363


   Benchmark runs are scheduled for baseline = dee4ba31773e3ba854e91f8b5df1423a0c88f352 and contender = 11be9c542b9699b7eb4ae1656775c9b30639e415. 11be9c542b9699b7eb4ae1656775c9b30639e415 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6ce07ac40ab1440d849beff3a82db30e...c44db1870bbf4b608ed92bacb301ecf2/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/51b244431a4a48ab954e76aa8d1bc634...c9dd8a17747f413a8fb6c0734c585bdb/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/41a2ef22f9994aa0893b7c2658db4a07...3f5e418679cb4372ab08b307c241c0f9/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r764853445



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -106,6 +106,8 @@ struct ARROW_DS_EXPORT KeyValuePartitioningOptions {
   /// After splitting a path into components, decode the path components
   /// before parsing according to this scheme.
   SegmentEncoding segment_encoding = SegmentEncoding::Uri;
+  // Should the key be decoded according to the scheme above too
+  bool decode_key = false;

Review comment:
       In that case I think let's remove the option entirely.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm closed pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #11858:
URL: https://github.com/apache/arrow/pull/11858


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#issuecomment-992882363


   Benchmark runs are scheduled for baseline = dee4ba31773e3ba854e91f8b5df1423a0c88f352 and contender = 11be9c542b9699b7eb4ae1656775c9b30639e415. 11be9c542b9699b7eb4ae1656775c9b30639e415 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6ce07ac40ab1440d849beff3a82db30e...c44db1870bbf4b608ed92bacb301ecf2/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/51b244431a4a48ab954e76aa8d1bc634...c9dd8a17747f413a8fb6c0734c585bdb/)
   [Finished :arrow_down:0.62% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/41a2ef22f9994aa0893b7c2658db4a07...3f5e418679cb4372ab08b307c241c0f9/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Jedi18 commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
Jedi18 commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r767165717



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -106,6 +106,8 @@ struct ARROW_DS_EXPORT KeyValuePartitioningOptions {
   /// After splitting a path into components, decode the path components
   /// before parsing according to this scheme.
   SegmentEncoding segment_encoding = SegmentEncoding::Uri;
+  // Should the key be decoded according to the scheme above too
+  bool decode_key = false;

Review comment:
       Done, should be ready to be merged.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Jedi18 commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
Jedi18 commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r767165717



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -106,6 +106,8 @@ struct ARROW_DS_EXPORT KeyValuePartitioningOptions {
   /// After splitting a path into components, decode the path components
   /// before parsing according to this scheme.
   SegmentEncoding segment_encoding = SegmentEncoding::Uri;
+  // Should the key be decoded according to the scheme above too
+  bool decode_key = false;

Review comment:
       Done, should be ready to be merged.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r765802520



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -580,19 +580,23 @@ Result<util::optional<KeyValuePartitioning::Key>> HivePartitioning::ParseKey(
   // Static method, so we have no better place for it
   util::InitializeUTF8();
 
-  auto name = segment.substr(0, name_end);
+  std::string name;
   std::string value;
   switch (options.segment_encoding) {
     case SegmentEncoding::None: {
+      name = segment.substr(0, name_end);
       value = segment.substr(name_end + 1);
-      if (ARROW_PREDICT_FALSE(!util::ValidateUTF8(value))) {
-        return Status::Invalid("Partition segment was not valid UTF-8: ", value);
+      if (ARROW_PREDICT_FALSE(!util::ValidateUTF8(value)) ||
+          ARROW_PREDICT_FALSE(!util::ValidateUTF8(name))) {
+        return Status::Invalid("Partition segment was not valid UTF-8: ", name, value);

Review comment:
       nit: this will jumble the name and value together, maybe just put `segment` instead of separate `name` and `value`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#issuecomment-992566504


   Thank you @Jedi18!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Jedi18 commented on a change in pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
Jedi18 commented on a change in pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#discussion_r767822312



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -580,19 +580,23 @@ Result<util::optional<KeyValuePartitioning::Key>> HivePartitioning::ParseKey(
   // Static method, so we have no better place for it
   util::InitializeUTF8();
 
-  auto name = segment.substr(0, name_end);
+  std::string name;
   std::string value;
   switch (options.segment_encoding) {
     case SegmentEncoding::None: {
+      name = segment.substr(0, name_end);
       value = segment.substr(name_end + 1);
-      if (ARROW_PREDICT_FALSE(!util::ValidateUTF8(value))) {
-        return Status::Invalid("Partition segment was not valid UTF-8: ", value);
+      if (ARROW_PREDICT_FALSE(!util::ValidateUTF8(value)) ||
+          ARROW_PREDICT_FALSE(!util::ValidateUTF8(name))) {
+        return Status::Invalid("Partition segment was not valid UTF-8: ", name, value);

Review comment:
       Ok done, thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#issuecomment-992882363


   Benchmark runs are scheduled for baseline = dee4ba31773e3ba854e91f8b5df1423a0c88f352 and contender = 11be9c542b9699b7eb4ae1656775c9b30639e415. 11be9c542b9699b7eb4ae1656775c9b30639e415 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6ce07ac40ab1440d849beff3a82db30e...c44db1870bbf4b608ed92bacb301ecf2/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/51b244431a4a48ab954e76aa8d1bc634...c9dd8a17747f413a8fb6c0734c585bdb/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/41a2ef22f9994aa0893b7c2658db4a07...3f5e418679cb4372ab08b307c241c0f9/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #11858: ARROW-14737: [C++][Dataset] Support URI-decoding partition keys

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #11858:
URL: https://github.com/apache/arrow/pull/11858#issuecomment-986060173


   https://issues.apache.org/jira/browse/ARROW-14737


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org