You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "svenatarms (via GitHub)" <gi...@apache.org> on 2023/04/05 08:22:18 UTC

[GitHub] [arrow] svenatarms opened a new issue, #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

svenatarms opened a new issue, #34905:
URL: https://github.com/apache/arrow/issues/34905

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   ### Describe the bug, including details regarding any error messages, version, and platform.
   #### Environment
   
   OS: Windows/Linux
   Python: 3.10.10
   s3fs: from 2022.7.1 to 2023.3.0 (doesn't matter)
   S3 Backend: MinIO / Ceph (doesn't matter)
   
   #### Description
   
   Version 11.0.0 of pyarrow introduced an unexpected behavior when uploading Parquet Files to an S3 Bucket (using s3fs.S3FileSystem), if the Path to the Parque File contains white spaces. White Spaces will be replaced by URL encoded Syntax %20 e.g:
   A Directory Name like:
   
   > product=My Fancy Product
   
   becomes
   
   > product=My%20Fancy%20Product
   
   on S3 filesystem. **NOTICE**: the Equal Sign `=` is URL encoded for the request, but won't become %3D on S3 filesystem. That means, the URL encoded equal sign `=` seems to be interpreted correctly
   
   #### Example Code
   ```python
   # s3fs FileSystem Object
   def return_s3filesystem(url, user, pw):
       fs = s3fs.S3FileSystem(
           anon=False,
           use_ssl=True,
           client_kwargs={
               "endpoint_url": url,
               "aws_access_key_id": user,
               "aws_secret_access_key": pw,
               "verify": False,
           }
       )
       return fs
   
   
   def write_df_to_s3(df, partition_cols, path_to_s3_object, url, user, pw, more_than_one_date_per_file,
                      delete_parquet_files):
      '''
       write Parquet File from Pandas DataFrame to S3 Bucket
      '''
      
      # instantiate s3fs.S3FileSystem object
       fs = return_s3filesystem(url, user, pw)
       # if the parquet file allready exists, delete it if requested, to prevent duplicated data
       delete_if_exists(fs, path_to_s3_object, df, more_than_one_date_per_file, delete_existing_files=delete_parquet_files)
       try:
          # create ArrowTable from DataFrame
           arrow_table = Table.from_pandas(df)
       except ArrowTypeError as e:
           # this is Error No. 1626701451158
           raise InvalidDataFrame(errorno=1626701451158, dataframe=df, arrowexception=e)
       except TypeError as e:
           raise InvalidDataFrame(errorno=1627657641211, dataframe=df, arrowexception=e)
       try:
          # write Parquet File to S3 Bucket, using S3FileSystem object 'fs' from above. Create directories by partition_cols
           pq.write_to_dataset(arrow_table,
                               path_to_s3_object,
                               partition_cols=partition_cols,
                               filesystem=fs,
                               use_dictionary=False,
                               data_page_size=100000,
                               compression="snappy",
                               version="2.0")
       except ArrowTypeError as e:
           raise InvalidDataFrame(errorno=1627575189, dataframe=df, arrowexception=e)
       except aiohttp.client_exceptions.ClientConnectionError as e:
           raise S3ConnectionError(errorno=1627575130, exmsg=e)
   
   ```
   
   #### Example Result
   ##### Expected Result (using pyarrow 10.0.1)
   ![image](https://user-images.githubusercontent.com/122540156/230012242-0076a82a-f985-495f-a2a4-851928f42a3d.png)
   
   ###### Debug output
   ```
   botocore.endpoint - DEBUG - Sending http request: <AWSPreparedRequest stream_output=False, method=POST, url=http://localhost:9000/my-products/product%3DMy%20Fancy%20Product/date%3D2023-01-05/0d5d1f2c503247
   2dbad1d17c845d5432-0.parquet?uploadId=NDBhYjllZDEtNWIxOC00ZTBlLWI4ODYtOGRhZjBhNTg3NzQ5LjYxNTFhMDBlLTQxMmQtNDQ5Ni05YjBjLTBiMGM3ODI3MzhkMg, headers={'User-Agent': b'Botocore/1.27.59 Python/3.10.10 Windows/10
   ', 'X-Amz-Date': b'20230405T073129Z', 'X-Amz-Content-SHA256': b'41dccb632a0540f4f83eaf7138f97c5dd63c09410cbc3aa3412963b2f7006f18', 'Authorization': b'AWS4-HMAC-SHA256 Credential=******/*******/us-east
   -1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=7d89128993d6a226d3ac4fa3e6adbb60f638f28c37265446284ca6d629c837f8', 'amz-sdk-invocation-id': b'832bc91b-c285-4413-ad3d-546a3
   bcefb59', 'amz-sdk-request': b'attempt=1', 'Content-Length': '357'}>
   botocore.parsers - DEBUG - Response headers: HTTPHeaderDict({'accept-ranges': 'bytes', 'cache-control': 'no-cache', 'content-length': '471', 'content-security-policy': 'block-all-mixed-content', 'content-t
   ype': 'application/xml', 'etag': '"caca775951f07ca64f530aae539fe5cd-3"', 'server': 'MinIO', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'vary': 'Accept-Encoding', 'x-accel-buffering
   ': 'no', 'x-amz-id-2': 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'x-amz-request-id': '1752F9747004F3E5', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 
   'date': 'Wed, 05 Apr 2023 07:31:29 GMT'})
   botocore.parsers - DEBUG - Response body:
   b'<?xml version="1.0" encoding="UTF-8"?>\n<CompleteMultipartUploadResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://localhost:9000/my-products/product=My%20Fancy%20Product/date=2023-0
   1-05/0d5d1f2c5032472dbad1d17c845d5432-0.parquet</Location><Bucket>my-products</Bucket><Key>product=My Fancy Product/date=2023-01-05/0d5d1f2c5032472dbad1d17c845d5432-0.parquet</Key><ETag>&#34;caca775951f07c
   a64f530aae539fe5cd-3&#34;</ETag></CompleteMultipartUploadResult>'
   botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <function check_for_200_error at 0x0000023C49ACA3B0>
   botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <aiobotocore.retryhandler.AioRetryHandler object at 0x0000023C4D8B64D0>
   botocore.retryhandler - DEBUG - No retry needed.
   botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <bound method AioS3RegionRedirector.redirect_from_error of <aiobotocore.utils.AioS3RegionRedirector object at 0x000002
   3C4D8B6590>>
   
   ```
   
   ##### Actual result (using pyarrow 11.0.0)
   ![image](https://user-images.githubusercontent.com/122540156/230014637-03f361dd-7a9b-4309-9a8e-9ad106c4ca23.png)
   
   ###### Debug output
   ```
   botocore.endpoint - DEBUG - Sending http request: <AWSPreparedRequest stream_output=False, method=POST, url=http://localhost:9000/my-products/product%3DMy%2520Fancy%2520Product/date%3D2023-01-10/a724b93c25
   1a486b897eb7b151c622bd-0.parquet?uploadId=NDBhYjllZDEtNWIxOC00ZTBlLWI4ODYtOGRhZjBhNTg3NzQ5LjNlOGIyZmI4LWM4ZDEtNDU0ZS1iNjA0LWMxZjczNTI1NjhmZQ, headers={'User-Agent': b'Botocore/1.27.59 Python/3.10.10 Window
   s/10', 'X-Amz-Date': b'20230405T073854Z', 'X-Amz-Content-SHA256': b'316db9078636bc3acba7fc81ff32a5704c08a104bfaea7b5e15bf35db799e260', 'Authorization': b'AWS4-HMAC-SHA256 Credential=*****/*****/us-
   east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=0721f578ded50c01c1a64c05d62c628fb35f0e9385ffd3ecfa45423940995a63', 'amz-sdk-invocation-id': b'5b5bc340-7f6b-48cc-bf2a-0
   860f8fa859b', 'amz-sdk-request': b'attempt=1', 'Content-Length': '357'}>
   botocore.parsers - DEBUG - Response headers: HTTPHeaderDict({'accept-ranges': 'bytes', 'cache-control': 'no-cache', 'content-length': '479', 'content-security-policy': 'block-all-mixed-content', 'content-t
   ype': 'application/xml', 'etag': '"f44ab58edcc877c4d00075b9db28e4e5-3"', 'server': 'MinIO', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'vary': 'Accept-Encoding', 'x-accel-buffering
   ': 'no', 'x-amz-id-2': 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'x-amz-request-id': '1752F9DC0E0CE8AD', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 
   'date': 'Wed, 05 Apr 2023 07:38:54 GMT'})
   botocore.parsers - DEBUG - Response body:
   b'<?xml version="1.0" encoding="UTF-8"?>\n<CompleteMultipartUploadResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://localhost:9000/my-products/product=My%2520Fancy%2520Product/date=20
   23-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet</Location><Bucket>my-products</Bucket><Key>product=My%20Fancy%20Product/date=2023-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet</Key><ETag>&#34;f44ab5
   8edcc877c4d00075b9db28e4e5-3&#34;</ETag></CompleteMultipartUploadResult>'
   botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <function check_for_200_error at 0x00000207A8422710>
   botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <aiobotocore.retryhandler.AioRetryHandler object at 0x00000207ADA9EE30>
   botocore.retryhandler - DEBUG - No retry needed.
   botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <bound method AioS3RegionRedirector.redirect_from_error of <aiobotocore.utils.AioS3RegionRedirector object at 0x000002
   07ADA9EEF0>>
   ```
   The difference in the debug output is the line  starting with **botocore.parsers - DEBUG - Response body:**. In the XML Part, the Node `<Key></Key>` contains an URL Encoded string (pyarrow 11.0.0) vs. "human readable" String (pyarrow 10.0.1). But the URL encoded string is not URL encoded at all, as mentioned before e.g. the equal sign `=` is intepreted as expected.
   
   It seems, that the URL encode/decode(?) isn't done correctly at all?
   
   Wild guess of mine: This behavior might be introduced by: #33598 and/or #33468 
   
   Thanks,
   Sven
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] unexpected URL encoded path (white spaces) when uploading to S3 [arrow]

Posted by "sahitya-pavurala (via GitHub)" <gi...@apache.org>.
sahitya-pavurala commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1987544684

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] unexpected URL encoded path (white spaces) when uploading to S3 [arrow]

Posted by "mitchelladam (via GitHub)" <gi...@apache.org>.
mitchelladam commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-2047425289

   This is the case for GCS as well as S3.
   we just encountered this when updating from pyarrow 10.0.1 to 14.0.2 but is present in all versions from 11.0.0 onwards.
   it is present for both the GCSFS library and the pyarrow.fs.GcsFileSystem
   example code:
   
   `import gcsfs
   import pyarrow as pa
   import pyarrow.fs as pafs
   import pyarrow.dataset as ds
   import datetime
   
   #%%
   fs = gcsfs.GCSFileSystem()
   #%%
   data = {
           "some_timestamp": [datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=1),
                               datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=2),
                               datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=3)],
           "value1": ["hello", "world", "foo"],
           "value2": [123, 456, 789]
       }
   schema = pa.schema([
       pa.field("some_timestamp", pa.timestamp("ms")),
       pa.field("value1", pa.string()),
       pa.field("value2", pa.int64())
   ])
   #%%
   result_pya_table = pa.Table.from_pydict(data, schema=schema)
   #%%
   # fs = pafs.GcsFileSystem()
   ds.write_dataset(
       data=result_pya_table,
       base_dir=f"adam_ryota_data/pyarrowfstest/2023.12.2.post1-10.0.1/",
       format='parquet',
       partitioning=["some_timestamp"],
       partitioning_flavor='hive',
       existing_data_behavior='overwrite_or_ignore',
       basename_template="data-{i}.parquet",
       filesystem=fs
   )`
   
   10.0.1
   results in:
   ![image](https://github.com/apache/arrow/assets/17411882/6b4e2eb0-0166-47a9-bef7-7cc447e3927d)
   
   11.0.0 or higher results in:
   ![image](https://github.com/apache/arrow/assets/17411882/2ec27759-167d-4571-9a51-e1dce6972bda)
   
   
   note that it is not part of the overall uri being encoded. only the data within the dataset is affected by this.
   when using the hive partition as part of the path:
   `data = {
           # "some_timestamp": [datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=1),
           #                     datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=2),
           #                     datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=3)],
           "value1": ["hello", "world", "foo"],
           "value2": [123, 456, 789]
       }
   schema = pa.schema([
       # pa.field("some_timestamp", pa.timestamp("ms")),
       pa.field("value1", pa.string()),
       pa.field("value2", pa.int64())
   ])
   #%%
   result_pya_table = pa.Table.from_pydict(data, schema=schema)
   #%%
   # fs = pafs.GcsFileSystem()
   # some_timestamp=2024-04-07 11:13:27.169
   ds.write_dataset(
       data=result_pya_table,
       base_dir=f"adam_ryota_data/manualhive/2023.12.2.post1-10.0.1/some_timestamp=2024-04-07 11:13:27.169/",
       format='parquet',
       # partitioning=["some_timestamp"],
       # partitioning_flavor='hive',
       existing_data_behavior='overwrite_or_ignore',
       basename_template="data-{i}.parquet",
       filesystem=fs`
   
   even in 11.0.0+
   the data is written as expected.
   ![image](https://github.com/apache/arrow/assets/17411882/ecba744b-3808-42d6-893d-39609f9ef180)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] unexpected URL encoded path (white spaces) when uploading to S3 [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1812824575

   Hi @jainamshah102 might you be still interesting in tackling this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] unexpected URL encoded path (white spaces) when uploading to S3 [arrow]

Posted by "sahitya-pavurala (via GitHub)" <gi...@apache.org>.
sahitya-pavurala commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1987534793

   Can I be assigned to solve for this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1502040774

   This was introduced by the solution for https://github.com/apache/arrow/issues/33448.  It looks like we made a backwards incompatible change here which is unfortunate.
   
   > NOTICE: the Equal Sign = is URL encoded for the request, but won't become %3D on S3 filesystem. That means, the URL encoded equal sign = seems to be interpreted correctly
   
   I'm not sure it's relevant to my greater point but I don't think the Equal Sign is encoded in the request:
   
   > \<Key>product=My%20Fancy%20Product/date=2023-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet\</Key>
   
   Unfortunately, it is a tricky problem.  The encoding here is not to support HTTP requests (in S3 all these paths go into the HTTP body and are not part of the URI) but instead to support two different problems.
   
   First, we need to support the concept of hive partitioning.  In hive partitioning there is a special meaning behind the `=` and `/` characters because `{x:3, y:7}` gets encoded as `x=3/y=7`.  This caused issues if the hive keys or hive values had `/` or `=` and so the solution was to encode the value (in retrospect I suppose we should be encoding the keys as well).
   
   Second, most filesystems only support a reserved set of characters.  Note that even S3 [doesn't fully support space](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html):
   
   > Space – Significant sequences of spaces might be lost in some uses (especially multiple spaces)
   
   To solve this problem we are now using uriparser's RFC3986 encode function.  This is an imprecise approach.  It is converting more characters than strictly needed in all cases.  However, there is some precedence for this (Spark) and I fear that anything more narrow would be too complex and/or unintuitive.
   
   I'd support a PR to turn encoding on and off entirely (either as an argument to a partitioning object or part of the write_dataset options).  The default could be on and then users could choose to disable this feature.  Users are then responsible for ensuring their partitioning values consist of legal characters for their filesystem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1523744152

   I've labeled this `good-first-issue` in case anyone wants to take a look at it.  I'm happy to provide more context.  The steps we need would be:
   
    * Add an bool option to `HivePartitioningOptions` to disable / enable URI escaping (in src/arrow/dataset/partition.h)
    * Adjust the implementation to respect this option (in `HivePartitioning::FormatValues` in src/arrow/dataset/partition.cc)
    * Add unit tests (in partition_test.cc)
    * Add pyarrow bindings for this option (in the `HivePartitioning` class in dataset.pyx, just look for `null_fallback` which is a similar option)
    * Add pyarrow unit tests to ensure the bindings are setup right
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] DarthData410 commented on issue #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

Posted by "DarthData410 (via GitHub)" <gi...@apache.org>.
DarthData410 commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1529692582

   @westonpace I can take this own, get a PR together for your review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] svenatarms commented on issue #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

Posted by "svenatarms (via GitHub)" <gi...@apache.org>.
svenatarms commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1511539440

   Thanks for looking into the issue and tracking down the cause. I like the idea, to be able to turn encoding off for backwards compatibility. On our side we'll change the behavior of our application to ensure that partitioning values won't have characters like whitespace anymore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] DarthData410 commented on issue #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

Posted by "DarthData410 (via GitHub)" <gi...@apache.org>.
DarthData410 commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1529933434

   I took a look at this a bit, and this is not a good fit for me to dive into right now. Maybe some other issue in the future. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jainamshah102 commented on issue #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

Posted by "jainamshah102 (via GitHub)" <gi...@apache.org>.
jainamshah102 commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1627650449

   I am interested to work on this issue. Can you provide some guidance and assistance in resolving this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34905: [Python] unexpected URL encoded path (white spaces) when uploading to S3

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-1629707890

   @jainamshah102 that's great.  You will first want to get a C++ development environment setup for Arrow and make sure you can build and run the tests (this is a complex task).  The [C++ development guide](https://arrow.apache.org/docs/dev/developers/cpp/index.html) should help.  In addition you might want to look at the [first PR guide](https://arrow.apache.org/docs/developers/guide/step_by_step/index.html) if you have not made a PR for Arrow before.
   
   Once everything is building correctly you will want to create a unit test that reproduces this issue.  This would probably be in `cpp/src/arrow/dataset/partition_test.cc`.  Some general context:
   
   The class `arrow::dataset::Partitioning` is a pure virtual class (e.g. an interface) that turns paths into expressions and back.  For example, a directory partitioning could turn the path `/7/12` into the expression `x==7 && y == 12`.  The hive partitioning would turn that same expression into the path `/x=7/y=12` (hive partitioning is `key=value` style and directory partitioning omits the keys).
   
   This is done with two methods `Format` and `Parse`.  The problem here is with the `HivePartitioning` class.  Currently, in `Format`, we url encode the path.  Then, in parse, we url decode the path.  The ask is to add a new option to `HivePartitioning` (perhaps named `escape_paths` which, if set to true, will use the current behavior and, if set to false, will skip the url encoding/decoding.
   
   Let me know if you run into more problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org