You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Prashant Wason (Jira)" <ji...@apache.org> on 2020/05/29 22:10:00 UTC

[jira] [Closed] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

     [ https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prashant Wason closed HUDI-797.
-------------------------------
    Resolution: Abandoned

> Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord
> -------------------------------------------------------------------------------
>
>                 Key: HUDI-797
>                 URL: https://issues.apache.org/jira/browse/HUDI-797
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO encoded records. These records have a [schema |https://avro.apache.org/docs/current/spec.html]which is determined by the dataset user and provided to HUDI during the writing process (as part of HUDIWriteConfig). The records are finally saved in [parquet |https://parquet.apache.org/]files which include the schema (in parquet format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records to aid in book-keeping and indexing. To achieve this, the incoming schema needs to be modified by adding the HUDI metadata fields and is called the HUDI schema for the dataset. Each incoming record is then re-written to translate it from the incoming schema into the HUDI schema. Re-writing the incoming records to a new schema is reasonably fast as it looks up all fields in the incoming record and adds them to a new record. But since this takes place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of datasets, even small improvements in the CPU-bound conversion can translate into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)