You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Nikita Sheremet (Jira)" <ji...@apache.org> on 2021/08/11 18:04:00 UTC

[jira] [Commented] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys

    [ https://issues.apache.org/jira/browse/HUDI-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397525#comment-17397525 ] 

Nikita Sheremet commented on HUDI-1363:
---------------------------------------

h3. [~vinoth] 

Conversation history from emails:
{quote} 

Hi Vinoth,

Thank you for your assistance.


Let me state our use case. I'll post it on Jira.
We have ~100TB Data Lake with personal data.
We use Google BigQuery Omni to query the data because not every tool can query such a volume of data quickly.
Also, we need to remove personal data from time to time, so we've decided to use Hudi.
The problem is we converted the data from regular parquet to Hudi parquet and BigQuery couldn't build its index anymore and query data efficiently.
We got an error:
BigQuery error in mk operation: Error while reading table: data_detections_hudi, error message: Failed to add partition key y (type: TYPE_INT64) to schema, because another column with the same name was already present. This is not allowed. Full partition schema: [y:TYPE_INT64, m:TYPE_INT64, d:TYPE_INT64]. 
So we need to get rid of Y,M,D columns somehow. Seems like the PR might solve our problem.
Unfortunately, we can't use Hudi due to this issue.
We would really appreciate it if could you help us to solve the issue.{quote}
 
 
{quote}Thanks, that helps a lot! are you using [https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs] ? 
y, m, d are already present in the input data frame and you are doing a `df.write.format("hudi").partitionBy("y", "m", "d")` ? 
I think the spark parquet write explicitly removes this. We keep it so, you can say repartition later and build different range indexes lets say off the file. 
But, I understand BQ expects this. I am also thinking about workarounds, where we can name the partition columns differently than y, m, d . 
Let's continue on the JIRA if you don't mind.  {quote}
 
"I think the spark parquet write explicitly removes this. "
If we speak about just columns - yes. But switching to pure parquet writer removes all hudi features like indexing and fast deleting records by id.
To say the truth it is very hard to do something with workarounds and keep hudi features. For example it is impossible to read (via hudi) from full path (like mytable/y=2020/m=01/d=01) remove columns from parquet and write to the same path directly - hudi start to figure out the table in hive catalog and failed without touching any data. 

> Provide Option to drop columns after they are used to generate partition or record keys
> ---------------------------------------------------------------------------------------
>
>                 Key: HUDI-1363
>                 URL: https://issues.apache.org/jira/browse/HUDI-1363
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Writer Core
>            Reporter: Balaji Varadarajan
>            Assignee: liwei
>            Priority: Major
>             Fix For: 0.9.0
>
>
> Context: https://github.com/apache/hudi/issues/2213



--
This message was sent by Atlassian Jira
(v8.3.4#803005)