You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 11:23:42 UTC

[GitHub] [arrow-rs] alamb commented on issue #40: Add parquet test file for all supported types in 2.5.0 format

alamb commented on issue #40:
URL: https://github.com/apache/arrow-rs/issues/40#issuecomment-826756956


   Comment from Dr. Christoph Jung(doc_schorsch) @ 2021-04-24T07:08:11.717+0000:
   <pre>I'm willing to volunteer for this one.
   
   [https://github.com/drcgjung]
   {quote}Being generally interested contributing to the datafusion/ballista development stream.
    ~5years of professional experience in Apache Spark (RDD & dataframe) for large-scale measurement data.
    ~4years open source contributions to JBoss (aka "Wildfly") a while back.
   {quote}
   Some obligatory questions:
    * Java API = [https://github.com/apache/parquet-mr] ?
   
    * Parquet Format = [https://github.com/apache/parquet-format] ?
   
    * Is it sufficient to restrict to the 2.6.0 format (Java API >= 1.11) ? Because there really was no Java API release using 2.5.0 and
    and [https://github.com/apache/arrow/blob/master/rust/parquet/README.md] refers to 2.6.0
   
    * "arrow-testing" = [https://github.com/apache/arrow-testing] ?
    * New folder there, data/parquet/types ?
    * Where to put the java generator project, also there? 
   
    * Better to have a single parquet with all the types or one parquet per basic type (can be many derived ones, see below)?
   
    * Would it be good to include the format version into the test parquet file name (for later additions when rust/parquet upgrades the format)?
   
    * I count 14 "plain" logical, parameterized types.
   
    * I count relevant 29 basic type instantiations, each could be represented mandatory and optional (=>58 test types)
    ** string
    ** enum
    ** uuid
    ** int_8, ... uint_64
    ** decimal_32, decimal_64 (maybe additional precision tests?)
    ** date
    ** time_utc_millis, time_utc_micros, time_utc_nanos, time_local_millis, time_local_micros, time_local_nanos
    ** timstamp_utc_millis, ... timestamp_local_nanos
    ** interval
    ** json, bson
   
    * Nested types could be derived in arbitrary combinations, but I guess its ok to have
    one LIST and two MAP types per basic test type (one as required key and one as value). Again,
    nested type could be mandatory and optional. (=> 58*2 + 29*2 + 58*2 = 290 nested test types)
   
    * There will be two PRs necessary because of the two repositories involved (arrow hard-linking to a version in the arrow-testing repo). The
    arrow PR will have to change the version link to the arrow-testing repo (which is maybe not safe for other arrow subprojects). Is that ok?
   
   Thanks if/for considering me ;)
   
    
   
    
   
    </pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org