You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2021/02/17 10:18:00 UTC
[jira] [Updated] (PARQUET-1985) Improve integration tests between implementations

     [ https://issues.apache.org/jira/browse/PARQUET-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Szadovszky updated PARQUET-1985:
--------------------------------------
    Description: 
We have a lack of proper integration tests between components. Fortunately, we already have a git repository to upload test data: https://github.com/apache/parquet-testing.

The idea is the following.
Create a directory structure for the different versions of the implementations containing parquet files with defined data. The structure definition shall be self-descriptive so we can write integration tests that reads the whole structure automatically and also works with files to be added later.

The following directory structure is an example for the previous requirements:
{noformat}
test-data/
├── impala
│   ├── 3.2.0
│   │   └── basic-data.parquet
│   ├── 3.3.0
│   │   └── basic-data.parquet
│   └── 3.4.0
│       ├── basic-data.lz4.parquet
│       ├── basic-data.snappy.parquet
│       ├── some-specific-issue-2.parquet
│       ├── some-specific-issue-3.csv
│       ├── some-specific-issue-3_mode1.parquet
│       ├── some-specific-issue-3_mode2.parquet
│       └── some-specific-issue-3.schema
├── parquet-cpp
│   ├── 1.5.0
│   │   ├── basic-data.lz4.parquet
│   │   └── basic-data.parquet
│   └── 1.6.0
│       ├── basic-data.lz4.parquet
│       └── some-specific-issue-2.parquet
├── parquet-mr
│   ├── 1.10.2
│   │   └── basic-data.parquet
│   ├── 1.11.1
│   │   ├── basic-data.parquet
│   │   └── some-specific-issue-1.parquet
│   ├── 1.12.0
│   │   ├── basic-data.br.parquet
│   │   ├── basic-data.lz4.parquet
│   │   ├── basic-data.snappy.parquet
│   │   ├── basic-data.zstd.parquet
│   │   ├── some-specific-issue-1.parquet
│   │   └── some-specific-issue-2.parquet
│   ├── some-specific-issue-1.csv
│   └── some-specific-issue-1.schema
├── basic-data.csv
├── basic-data.schema
├── some-specific-issue-2.csv
└── some-specific-issue-2.schema
{noformat}
Parquet files are created at leaf level. The expected data is saved in a csv format (to be specified: separators, how to save binary etc.), the expected schema (to specify the data types independently from the parquet files) are saved in .schema files. The csv and schema files can be saved on the same level of the parquet files or upper levels if they are common to several parquet files.

Any comments about the idea are welcomed. 


  was:
We have a lack of proper integration tests between components. Fortunately, we already have a git repository to upload test data: https://github.com/apache/parquet-testing.

The idea is the following.
Create a directory structure for the different versions of the implementations containing parquet files with defined data. The structure definition shall be self-descriptive so we can write integration tests that reads the whole structure automatically and also works with files to be added later.

The following directory structure is an example for the previous requirements:
{noformat}
test-data/
├── basic-data.csv
├── basic-data.schema
├── impala
│   ├── 3.2.0
│   │   └── basic-data.parquet
│   ├── 3.3.0
│   │   └── basic-data.parquet
│   └── 3.4.0
│       ├── basic-data.lz4.parquet
│       ├── basic-data.snappy.parquet
│       ├── some-specific-issue-2.parquet
│       ├── some-specific-issue-3.csv
│       ├── some-specific-issue-3_mode1.parquet
│       ├── some-specific-issue-3_mode2.parquet
│       └── some-specific-issue-3.schema
├── parquet-cpp
│   ├── 1.5.0
│   │   ├── basic-data.lz4.parquet
│   │   └── basic-data.parquet
│   └── 1.6.0
│       ├── basic-data.lz4.parquet
│       └── some-specific-issue-2.parquet
├── parquet-mr
│   ├── 1.10.2
│   │   └── basic-data.parquet
│   ├── 1.11.1
│   │   ├── basic-data.parquet
│   │   └── some-specific-issue-1.parquet
│   ├── 1.12.0
│   │   ├── basic-data.br.parquet
│   │   ├── basic-data.lz4.parquet
│   │   ├── basic-data.snappy.parquet
│   │   ├── basic-data.zstd.parquet
│   │   ├── some-specific-issue-1.parquet
│   │   └── some-specific-issue-2.parquet
│   ├── some-specific-issue-1.csv
│   └── some-specific-issue-1.schema
├── some-specific-issue-2.csv
└── some-specific-issue-2.schema
{noformat}
Parquet files are created at leaf level. The expected data is saved in a csv format (to be specified: separators, how to save binary etc.), the expected schema (to specify the data types independently from the parquet files) are saved in .schema files. The csv and schema files can be saved on the same level of the parquet files or upper levels if they are common to several parquet files.

Any comments about the idea are welcomed. 



> Improve integration tests between implementations
> -------------------------------------------------
>
>                 Key: PARQUET-1985
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1985
>             Project: Parquet
>          Issue Type: Test
>          Components: parquet-testing
>         Environment: {noformat}
> *no* further _formatting_ is done here
> {noformat}
>            Reporter: Gabor Szadovszky
>            Priority: Major
>
> We have a lack of proper integration tests between components. Fortunately, we already have a git repository to upload test data: https://github.com/apache/parquet-testing.
> The idea is the following.
> Create a directory structure for the different versions of the implementations containing parquet files with defined data. The structure definition shall be self-descriptive so we can write integration tests that reads the whole structure automatically and also works with files to be added later.
> The following directory structure is an example for the previous requirements:
> {noformat}
> test-data/
> ├── impala
> │   ├── 3.2.0
> │   │   └── basic-data.parquet
> │   ├── 3.3.0
> │   │   └── basic-data.parquet
> │   └── 3.4.0
> │       ├── basic-data.lz4.parquet
> │       ├── basic-data.snappy.parquet
> │       ├── some-specific-issue-2.parquet
> │       ├── some-specific-issue-3.csv
> │       ├── some-specific-issue-3_mode1.parquet
> │       ├── some-specific-issue-3_mode2.parquet
> │       └── some-specific-issue-3.schema
> ├── parquet-cpp
> │   ├── 1.5.0
> │   │   ├── basic-data.lz4.parquet
> │   │   └── basic-data.parquet
> │   └── 1.6.0
> │       ├── basic-data.lz4.parquet
> │       └── some-specific-issue-2.parquet
> ├── parquet-mr
> │   ├── 1.10.2
> │   │   └── basic-data.parquet
> │   ├── 1.11.1
> │   │   ├── basic-data.parquet
> │   │   └── some-specific-issue-1.parquet
> │   ├── 1.12.0
> │   │   ├── basic-data.br.parquet
> │   │   ├── basic-data.lz4.parquet
> │   │   ├── basic-data.snappy.parquet
> │   │   ├── basic-data.zstd.parquet
> │   │   ├── some-specific-issue-1.parquet
> │   │   └── some-specific-issue-2.parquet
> │   ├── some-specific-issue-1.csv
> │   └── some-specific-issue-1.schema
> ├── basic-data.csv
> ├── basic-data.schema
> ├── some-specific-issue-2.csv
> └── some-specific-issue-2.schema
> {noformat}
> Parquet files are created at leaf level. The expected data is saved in a csv format (to be specified: separators, how to save binary etc.), the expected schema (to specify the data types independently from the parquet files) are saved in .schema files. The csv and schema files can be saved on the same level of the parquet files or upper levels if they are common to several parquet files.
> Any comments about the idea are welcomed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)