You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/14 17:11:29 UTC

[GitHub] [incubator-hudi] ahmed-elfar edited a comment on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

ahmed-elfar edited a comment on issue #1498: Migrating parquet table to hudi issue [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-613563197

@vinothchandar I apologies for the delayed response, and thanks again for your help and detailed answers.

> Is it possible to share the data generation tool with us or point us to reproducing this ourselves locally? We can go much faster if we are able to repro this ourselves..

Sure, this is the public Repo for generating the data [https://github.com/gregrahn/tpch-kit](url)
And it provides the information you need for data generation, size, etc

> Schema for lineitem

Adding more details and update the schema screenshot mentioned on previous comment:

**RECORDKEY_FIELD_OPT_KEY**: is composite (l_linenumber, l_orderkey)
**PARTITIONPATH_FIELD_OPT_KEY**: optional default (non-portioned), or l_shipmode
**PRECOMBINE_FIELD_OPT_KEY**: l_commitdate, or generating new timestamp column **last_updated**.

![Screenshot from 2020-04-14 17-54-43](https://user-images.githubusercontent.com/20902425/79246806-f680cc00-7e79-11ea-8edc-bd711d8492ff.png)

This is the **official documentation** for the datasets definition, schema, queries and business logic behind the queries [http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf](url)

> @bvaradar & @umehrot2 will have the ability to seamlessly bootstrap the data into hudi without rewriting in the next release.

Are we talking about the proposal mentioned at [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi](url)

because we need more clarification regarding this approach.

> you ll also have the ability to do a one-time bulk_insert for last N partitions to get the upsert performance benefits as we discussed above

There one of the attempts mentioned on the first comment which might be similar, I will explain in details to check with you if it should work for now or not, **if it provides a valid Hudi table**:

So consider we have table of size 1TB parquet format as an input table either **partitioned** or **non- partitioned**. Spark resource 256GB ram and 32 cores:

**Case non-partitioned**

- We use the suggested/recommended partition column(s), then project this partition column(s) and apply distinct which will provide you with filter values you need to pass to next process of the pipe line.
- The next step submit a sequential spark applications which filter the input data based on the passed filter value(s) resulting in data frame of single partition.
- Write (**bulk-insert**) the filtered dataframe as Hudi table with the provided partition column using save-mode **append**
- Hudi table being written partition by partition.
- Query the Hudi table to check if it is valid table, and it looks valid.

**Case partitioned** same as above, with faster filter operations.

**Pros**:
- Avoided a lot of disk spilling, GC hits.
- Using less resources for initial loading.

**Cons**:
- No time improvements in case you have enough resources to load the table at once.
- We ended up with partitioned table, which might not be needed in some of our use cases.

**Questions**:
- If this approach is valid or going to impact the upsert operations in the future?

> I would be happy to jump on a call with you folks and get this moving along.. I am also very excited to work with a user like yourself and move the perf aspects of the project along more..

We are excited as well to have a call together, please inform me how we can proceed on this meeting.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services