You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Radu Teodorescu (Jira)" <ji...@apache.org> on 2021/09/20 22:38:00 UTC
[jira] [Updated] (ARROW-14047) [C++] [Parquet] FileReader returns
inconsistent results on repeat reads
[ https://issues.apache.org/jira/browse/ARROW-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Radu Teodorescu updated ARROW-14047:
------------------------------------
Summary: [C++] [Parquet] FileReader returns inconsistent results on repeat reads (was: FileReader returns inconsistent results on repeat reads)
> [C++] [Parquet] FileReader returns inconsistent results on repeat reads
> -----------------------------------------------------------------------
>
> Key: ARROW-14047
> URL: https://issues.apache.org/jira/browse/ARROW-14047
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 5.0.0
> Environment: Centos 7 gcc 9.2.0
> Reporter: Radu Teodorescu
> Priority: Major
> Attachments: Capture.PNG, writeReadRowGroup.parquet
>
>
> We are seeing that for certain data sets when dealing with lists of structs, repeated reads yield different results - I have a file that exhibits this behavior and below is the code for reproducing it:
> {code:java}
> filesystem::path filePath = dirPath / "writeReadRowGroup.parquet";
> arrow::MemoryPool *pool = arrow::default_memory_pool(); std::shared_ptr<arrow::io::ReadableFile> infile;
> PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(filePath, pool));
> std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
> auto status = parquet::arrow::OpenFile(infile, pool, &arrow_reader);
> CHECK_OK(status); std::shared_ptr<arrow::Schema> readSchema;
> CHECK_OK(arrow_reader->GetSchema(&readSchema));
> std::shared_ptr<arrow::Table> table;
> std::vector<int> indicesToGet;
> CHECK_OK(arrow_reader->ReadTable(&table)); auto recordListCol1 = arrow::Table::Make(arrow::schema({table->schema()->GetFieldByName("recordList")}),
> {table->GetColumnByName("recordList")}); for (int i = 0; i < 20; ++i) {
> cout << "data reread operation number = " + std::to_string(i) << endl;
> std::shared_ptr<arrow::Table> table2;
> CHECK_OK(arrow_reader->ReadTable(&table2));
> auto recordListCol2 = arrow::Table::Make(arrow::schema({table2->schema()->GetFieldByName("recordList")}),
> {table2->GetColumnByName("recordList")});
> bool equals = recordListCol1->Equals(*recordListCol2);
> if (!equals) {
> cout << recordListCol1->ToString() << endl;
> cout << endl << "new table" << endl;
> cout << recordListCol2->ToString() << endl;
> throw std::runtime_error("Subsequent re-read failure ");
> } }
> {code}
> Apparently, as shown in the attached capture the state machine used to track nulls is broken on subsequent usage
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)