You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2023/01/17 16:02:42 UTC

[GitHub] [beam] albertvillanova opened a new issue, #25041: [Bug]: Some written Parquet shard files have 0 num_rows

albertvillanova opened a new issue, #25041:
URL: https://github.com/apache/beam/issues/25041

   ### What happened?
   
   After the 2.44.0 release, we have found an issue when writing to Parquet using shards: some files have 0 number of rows.
   
   Steps:
   ```python
   with beam.Pipeline() as p:
       records = p | 'Read' >> beam.Create(
           [{'name': 'foo', 'age': 10}, {'name': 'bar', 'age': 20}]
       )
       _ = records | 'Write' >> beam.io.WriteToParquet("filename",
           pyarrow.schema(
               [('name', pyarrow.binary()), ('age', pyarrow.int64())]
           ), num_shards=2
       )
   
   for filename in ["filename-00000-of-00002", "filename-00001-of-00002"]:
       parquet_file = pyarrow.parquet.ParquetFile(filename)
       print(filename)
       print(parquet_file.metadata)
       print()
   ```
   
   We get one of the files has 0 number of rows:
   ```
   filename-00000-of-00002
   <pyarrow._parquet.FileMetaData object at 0x7f42d2362810>
     created_by: parquet-cpp-arrow version 9.0.0
     num_columns: 2
     num_rows: 2
     num_row_groups: 1
     format_version: 2.6
     serialized_size: 514
   
   filename-00001-of-00002
   <pyarrow._parquet.FileMetaData object at 0x7f42d2063680>
     created_by: parquet-cpp-arrow version 9.0.0
     num_columns: 2
     num_rows: 0
     num_row_groups: 0
     format_version: 2.6
     serialized_size: 340
   ```
   
   Before (in 2.43.0 version), none of the files had 0 number of rows:
   ```
   filename-00000-of-00002
   <pyarrow._parquet.FileMetaData object at 0x7f673a4dcb30>
     created_by: parquet-cpp-arrow version 9.0.0
     num_columns: 2
     num_rows: 1
     num_row_groups: 1
     format_version: 2.6
     serialized_size: 512
   
   filename-00001-of-00002
   <pyarrow._parquet.FileMetaData object at 0x7f6738cf3950>
     created_by: parquet-cpp-arrow version 9.0.0
     num_columns: 2
     num_rows: 1
     num_row_groups: 1
     format_version: 2.6
     serialized_size: 512
   ```
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] albertvillanova commented on issue #25041: [Bug]: Some written Parquet shard files have 0 num_rows

Posted by "albertvillanova (via GitHub)" <gi...@apache.org>.
albertvillanova commented on issue #25041:
URL: https://github.com/apache/beam/issues/25041#issuecomment-1545342017

   The same error persists in apache-beam versions:
   - 2.44.0
   - 2.45.0
   - 2.46.0
   - 2.47.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] johnjcasey commented on issue #25041: [Bug]: Some written Parquet shard files have 0 num_rows

Posted by "johnjcasey (via GitHub)" <gi...@apache.org>.
johnjcasey commented on issue #25041:
URL: https://github.com/apache/beam/issues/25041#issuecomment-1570805744

   We should look into why this changed, but I'm not sure this is incorrect behavior necessarily


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org