You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Miller, Tim" <th...@amazon.com.INVALID> on 2022/05/27 17:49:41 UTC

Spin off CLI into separate project?

I just wanted to bounce an idea off of everyone. One thing I notice is that there are certain bugs that show up when using the parquet-cli that don't show up when using it as an SDK in a Java program, even when reading the same files. There appears to be some duplicated code between the CLI and the rest of the SDK. One example I noticed is how there are changes that were made to detect UUIDs as a special case of a fixed length byte array, and while all the necessary changes are made in the SDK, some are missing from the duplicated code in the CLI. One thing we need to do is stop relying on the duplicated code in the CLI and have it exist ONLY as a thin wrapper around the SDK. And one way perhaps to force us to do that would be to maintain the CLI as a separate project. Of course, I haven't figured out what all these code inconsistencies are, so perhaps it'll turn out to be easy to just fix the CLI as it is, but the point is to adopt policies that make it harder to break some parts of ParquetMR when adding enhancements.

Thanks.