You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@nifi.apache.org by "Rajmund Takacs (Jira)" <ji...@apache.org> on 2024/02/27 10:40:00 UTC

[jira] [Updated] (NIFI-12843) If record count is set, ParquetRecordReader does not read the whole file

     [ https://issues.apache.org/jira/browse/NIFI-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajmund Takacs updated NIFI-12843:
----------------------------------
    Attachment: parquet_reader_usecases.json

> If record count is set, ParquetRecordReader does not read the whole file
> ------------------------------------------------------------------------
>
>                 Key: NIFI-12843
>                 URL: https://issues.apache.org/jira/browse/NIFI-12843
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.25.0, 2.0.0-M2
>            Reporter: Rajmund Takacs
>            Assignee: Rajmund Takacs
>            Priority: Major
>         Attachments: parquet_reader_usecases.json
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Earlier ParquetRecordReader ignored the record.count attribue of the incoming FlowFile. With NIFI-12241 this had been changed, and now the reader reads only the specified number of rows from the record set. But if the Parquet file is not produced by a record writer, then this attribute is not set normally, and in this case the record reader reads the whole file. However, processors producing parquet file by processing record sets, might have this attribute set, referring to the record set the parquet file is taken from, and not the actual content. This leads to an incorrect behavior.
> For example: ConsumeKafka produces a single record FlowFile, that is a parquet file with 1000 rows, then record.count would be set to 1, instead of 1000, because it refers to the Kafka record set. So ParquetRecordReader now reads only the first record of the Parquet file.
> The sole reason of changing the reader to take record.count into account is that, CalculateParquetOffsets processors generate flow files with same content, but different offset and count attributes, representing a slice of the original, big input. And then the parquet reader acts as if the big flow file was only a small one, containing that slice, which makes processing more efficient. There is no need to support files having no offset, but having a limit (count), so changing the reader to only take record.count into account, if offset is present too, could to be a reasonable fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)