You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/31 03:23:51 UTC

[GitHub] [arrow] westonpace commented on issue #11027: PyArrow Parquet column partitioning

westonpace commented on issue #11027:
URL: https://github.com/apache/arrow/issues/11027#issuecomment-908867010

I think ARROW-12644 fixes something different. My gut reaction would be to not do this. It seems reasonable to expect that partition columns only contain filesystem-safe paths.

Spark URL encodes non-safe characters (I'm not sure if it does this in all cases or just when using timestamps as a partition column) and ARROW-12644 was making sure we could read these but, as discussed in the JIRA, it isn't clear that we should support writing such paths.

`/'date=2021/08/30'/somedata.parquet` is not going to be a safe path on all filesystems so I don't think that is a viable alternative. If we were to URL encode paths and you would get `2021%2F08%2F30` which is an odd thing to have in the filesystem but it should at least work. Perhaps we need a URL encoding kernel and then you could partition on that column projected with URL encoding (although I don't think projection support is quite there yet).

Today, as a possible workaround, you could use pyarrow compute to do a replace on `/` with the character of your choosing: https://arrow.apache.org/docs/cpp/compute.html#string-transforms

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org