You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/04/12 15:58:46 UTC

[GitHub] [beam] lostluck commented on pull request #17347: implement parquetio to read/write parquet files

lostluck commented on PR #17347:
URL: https://github.com/apache/beam/pull/17347#issuecomment-1096911268

   > A few changes (mostly the Apache licensing) before this looks good. I'd also suggest looking at what it would take to write a Splittable DoFn (https://beam.apache.org/documentation/programming-guide/#splittable-dofns) version of this so the reads could scale.
   
   It looks like that parquet package could easily support a per-record level read for subfile splitting too. The metadata includes the [number of rows](https://pkg.go.dev/github.com/xitongsys/parquet-go/reader#ParquetReader.GetNumRows), and you can also [skip them](https://pkg.go.dev/github.com/xitongsys/parquet-go/reader#ParquetReader.SkipRows).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org