You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "johnseekins (via GitHub)" <gi...@apache.org> on 2023/03/13 15:06:06 UTC

[GitHub] [arrow] johnseekins opened a new issue, #34546: Datasets with partitions create a list instead of folders

johnseekins opened a new issue, #34546:
URL: https://github.com/apache/arrow/issues/34546

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I've been trying to create (and then write) a dataset using pyarrow:
   
   ```
   file_options = ds.ParquetFileFormat().make_write_options(compression="zstd")
   partitions = ds.partitioning(
       schema=pyarrow.schema([
           ("partition1", pyarrow.large_string()),
           ("partition2", pyarrow.large_string()),
           ("3", pyarrow.large_string()),
       ])
   dataset = ds.dataset(
       f"gs://{bucket}/path1/",
       partitioning=partitions,
   )
   table = dataset.scanner().to_table()
   ds.write_dataset(
       table,
       f"gs://{bucket}/path1/",
       format="parquet",
       file_options=file_options,
       partitioning=partitions,
       schema=table.schema,
       existing_data_behavior="overwrite_or_ignore",
   )
   ```
   
   This code never seems to turn the partitions into the actual expected strings:
   
   ```
   pyarrow.lib.ArrowInvalid: google::cloud::Status(INVALID_ARGUMENT: Permanent error CreateResumableUpload: Disallowed unicode characters present in object name 'curation/bills/[
     "partition1"
   ]/[
     "partition2"
   ]/[
     "3"
   ]/part-0.par...' error_info={reason=invalid, domain=global, metadata={http_status_code=400}}). Detail: [errno 22] Invalid argument
   ```
   
   Am I doing something wrong here? I _thought_ I was following the documentation...
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466570432

   Polars only uses large_string instead of string to avoid the limitations of the 32-bit offsets in string. Basically string can't hold more than 2GB of string data per array, so to have larger string columns they must either be chunked or use large_string with 64-bit offsets. But we should support large strings better in datasets!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] johnseekins closed issue #34546: Datasets with partitions create a list instead of folders

Posted by "johnseekins (via GitHub)" <gi...@apache.org>.

johnseekins closed issue #34546: Datasets with partitions create a list instead of folders
URL: https://github.com/apache/arrow/issues/34546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] johnseekins commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "johnseekins (via GitHub)" <gi...@apache.org>.

johnseekins commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466528724

   > Hi John, have you looked at the values in your `table`? Do any of them contain the value `[\n"partition1"\n]`?
   > 
   > I tried to reproduce this (just locally, not GCS yet), but was not able to (writing the dataset worked fine):
   > 
   > ```python
   > import pyarrow as pa
   > import pyarrow.dataset as ds
   > 
   > table = pa.table(
   >     {
   >         "partition1": ["1", "2", "3", "4", "5"],
   >         "partition2": ["2001", "2002", "2003", "2004", "2004"],
   >         "3": ["1", "2", "3", "4", "5"],
   >         "val": list(range(5)),
   >     }
   > )
   > file_options = ds.ParquetFileFormat().make_write_options(compression="zstd")
   > partitions = ds.partitioning(
   >     schema=pa.schema(
   >         [
   >             ("partition1", pa.large_string()),
   >             ("partition2", pa.large_string()),
   >             ("3", pa.large_string()),
   >         ]
   >     )
   > )
   > ds.write_dataset(
   >     table,
   >     f"tables/test_parts",
   >     format="parquet",
   >     file_options=file_options,
   >     partitioning=partitions,
   >     schema=table.schema,
   >     existing_data_behavior="overwrite_or_ignore",
   > )
   > ```
   
   Running this code _also_ fails for me:
   
   ```
   $ cat test.py
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   table = pa.table(
       {
           "partition1": ["1", "2", "3", "4", "5"],
           "partition2": ["2001", "2002", "2003", "2004", "2004"],
           "3": ["1", "2", "3", "4", "5"],
           "val": list(range(5)),
       }
   )
   file_options = ds.ParquetFileFormat().make_write_options(compression="zstd")
   partitions = ds.partitioning(
       schema=pa.schema(
           [
               ("partition1", pa.large_string()),
               ("partition2", pa.large_string()),
               ("3", pa.large_string()),
           ]
       )
   )
   ds.write_dataset(
       table,
       f"tables/test_parts",
       format="parquet",
       file_options=file_options,
       partitioning=partitions,
       schema=table.schema,
       existing_data_behavior="overwrite_or_ignore",
   )
   $ poetry run python test.py
   $ ls tables/test_parts/
   '['$'\n''  "1"'$'\n'']'  '['$'\n''  "2"'$'\n'']'  '['$'\n''  "3"'$'\n'']'  '['$'\n''  "4"'$'\n'']'  '['$'\n''  "5"'$'\n'']'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466538080

   Oh wait I'm actually seeing weird directories too. :confused:
   
   ```
   $ exa  --tree test_parts
   test_parts
   ├── [\n  "1"\n]
   │  └── [\n  "2001"\n]
   │     └── [\n  "1"\n]
   │        └── part-0.parquet
   ├── [\n  "2"\n]
   │  └── [\n  "2002"\n]
   │     └── [\n  "2"\n]
   │        └── part-0.parquet
   ├── [\n  "3"\n]
   │  └── [\n  "2003"\n]
   │     └── [\n  "3"\n]
   │        └── part-0.parquet
   ├── [\n  "4"\n]
   │  └── [\n  "2004"\n]
   │     └── [\n  "4"\n]
   │        └── part-0.parquet
   └── [\n  "5"\n]
      └── [\n  "2004"\n]
         └── [\n  "5"\n]
            └── part-0.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] johnseekins commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "johnseekins (via GitHub)" <gi...@apache.org>.

johnseekins commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466575220

   > Polars only uses large_string instead of string to avoid the limitations of the 32-bit offsets in string. Basically string can't hold more than 2GB of string data per array, so to have larger string columns they must either be chunked or use large_string with 64-bit offsets.
   
   I had a feeling that would be the reason why. Thanks for that clarification.
   
   While y'all work out what to do about `large_string`...is there another way around this problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466602699

   I'm going to re-open this so we make sure we fix the serialization of `large_string` partitioning. But good to hear that workaround works for you @johnseekins 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 closed issue #34546: [C++] Hive partitioning values contain newlines for large_string

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 closed issue #34546: [C++] Hive partitioning values contain newlines for large_string
URL: https://github.com/apache/arrow/issues/34546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466458699

   Hi John, have you looked at the values in your `table`? Do any of them contain the value `[\n"partition1"\n]`?
   
   I tried to reproduce this (just locally, not GCS yet), but was not able to (writing the dataset worked fine):
   
   ```python
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   table = pa.table(
       {
           "partition1": ["1", "2", "3", "4", "5"],
           "partition2": ["2001", "2002", "2003", "2004", "2004"],
           "3": ["1", "2", "3", "4", "5"],
           "val": list(range(5)),
       }
   )
   file_options = ds.ParquetFileFormat().make_write_options(compression="zstd")
   partitions = ds.partitioning(
       schema=pa.schema(
           [
               ("partition1", pa.large_string()),
               ("partition2", pa.large_string()),
               ("3", pa.large_string()),
           ]
       )
   )
   ds.write_dataset(
       table,
       f"tables/test_parts",
       format="parquet",
       file_options=file_options,
       partitioning=partitions,
       schema=table.schema,
       existing_data_behavior="overwrite_or_ignore",
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466587443

   I think even if the arrays are `large_string`, you can still specify `string` in the partitioning. This seems to work:
   
   ```python
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   table = pa.table(
       {
           "partition1": pa.array(["1", "2", "3", "4", "5"], pa.large_string()),
           "partition2": pa.array(["2001", "2002", "2003", "2004", "2004"], pa.large_string()),
           "3": pa.array(["1", "2", "3", "4", "5"], pa.large_string()),
           "val": list(range(5)),
       }
   )
   
   file_options = ds.ParquetFileFormat().make_write_options(compression="zstd")
   partitions = ds.partitioning(
       schema=pa.schema(
           [
               ("partition1", pa.string()),
               ("partition2", pa.string()),
               ("3", pa.string()),
           ]
       )
   )
   
   ds.write_dataset(
       table,
       f"tables/test_parts",
       format="parquet",
       file_options=file_options,
       partitioning=partitions,
       schema=table.schema,
       existing_data_behavior="overwrite_or_ignore",
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] johnseekins commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "johnseekins (via GitHub)" <gi...@apache.org>.

johnseekins commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466547726

   Interesting. If I'm honest, I'm not sure why the columns are `large_string`. I used polars a bit in writing the files, and it seems to be creating all columns as `large_string` instead of `string`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466542411

   This seems to be specific to using `pa.large_string()` in the partitioning scheme. If we use `pa.string()` instead, it works normally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] johnseekins commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "johnseekins (via GitHub)" <gi...@apache.org>.

johnseekins commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466513872

   That was bad copy/paste on my part. The values do not match the column names. I've updated the original example to reflect that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] johnseekins commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "johnseekins (via GitHub)" <gi...@apache.org>.

johnseekins commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466525868

   Interestingly, I _also_ can't reproduce this locally. So this seems to be purely a problem with uploading the dataset to GCS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] johnseekins commented on issue #34546: Datasets with partitions create a list instead of folders

Posted by "johnseekins (via GitHub)" <gi...@apache.org>.

johnseekins commented on issue #34546:
URL: https://github.com/apache/arrow/issues/34546#issuecomment-1466599618

   That solved my problem. I'm genuinely not sure why I didn't try that sooner. Thanks @wjones127 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org