You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/04/10 16:41:14 UTC

[GitHub] [arrow] westonpace commented on issue #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

westonpace commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1502040774

   This was introduced by the solution for https://github.com/apache/arrow/issues/33448.  It looks like we made a backwards incompatible change here which is unfortunate.
   
   > NOTICE: the Equal Sign = is URL encoded for the request, but won't become %3D on S3 filesystem. That means, the URL encoded equal sign = seems to be interpreted correctly
   
   I'm not sure it's relevant to my greater point but I don't think the Equal Sign is encoded in the request:
   
   > \<Key>product=My%20Fancy%20Product/date=2023-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet\</Key>
   
   Unfortunately, it is a tricky problem.  The encoding here is not to support HTTP requests (in S3 all these paths go into the HTTP body and are not part of the URI) but instead to support two different problems.
   
   First, we need to support the concept of hive partitioning.  In hive partitioning there is a special meaning behind the `=` and `/` characters because `{x:3, y:7}` gets encoded as `x=3/y=7`.  This caused issues if the hive keys or hive values had `/` or `=` and so the solution was to encode the value (in retrospect I suppose we should be encoding the keys as well).
   
   Second, most filesystems only support a reserved set of characters.  Note that even S3 [doesn't fully support space](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html):
   
   > Space – Significant sequences of spaces might be lost in some uses (especially multiple spaces)
   
   To solve this problem we are now using uriparser's RFC3986 encode function.  This is an imprecise approach.  It is converting more characters than strictly needed in all cases.  However, there is some precedence for this (Spark) and I fear that anything more narrow would be too complex and/or unintuitive.
   
   I'd support a PR to turn encoding on and off entirely (either as an argument to a partitioning object or part of the write_dataset options).  The default could be on and then users could choose to disable this feature.  Users are then responsible for ensuring their partitioning values consist of legal characters for their filesystem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org