You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/01/04 22:41:18 UTC

[GitHub] [parquet-mr] satishkotha commented on pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

satishkotha commented on pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730


   > I have a couple of comments but the change itself looks good.
   > 
   > Meanwhile, I have some problems with the concept. `parquet-tools` is a simple command in most of the environments. It means that it is not clear how to extend it with your own implementation.
   > I would suggest adding this change to the API only (in `parquet-hadoop`) so you can extend the implementation from the code. If you want to extend `parquet-tools` as well I would suggest adding your implementation to parquet and let the user choose from the existing implementations (by its simple name like `strict` or others not the class name).
   
   @gszadovszky I added one more strategy for merging key values. I would prefer to keep this in parquet-tools to make this easy to run for one-off debugging use cases.  Let me know if you strong opinion against this.
   
   Also, the actual merge strategy I need is specific to my use-case. So I'd like to keep referencing the class name. Here is an example to help explain:
   
   This is how I run parquet-tools
   java -cp  my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:<hadoop-jars> merge --s <myClass> f1.parquet f2.parquet merged.parquet
   
   Parquet files we write have additional metadata "range" and "count" (Count is not rowcount, its specific to my usecase tracking distinct values for a column).
   
   1) file f1.parquet metadata will have 
   
   - key: "range", value : "a-d"
   - key: "count", value: "12000"
   
   2) file f2.parquet metadata will have 
   - key: "range", value: "e-f"
   - key: "count", value: "10000"
   
   When we merge, we want to resulting parquet file to have metadata 
   - key: "range", value: "a-f"
   - key: "count", value: "15000"
   
   So merge strategy is different for different keys we store in parquet footer. This is specific to how we use parquet and very likely not usable for other use cases, so I prefer not to expose this strategy in parquet library.
   
   Let me know if you have any suggestions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org