You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by 管梓越 <gu...@bytedance.com> on 2021/10/06 06:53:46 UTC

Re: [Phishing Risk] [External] [Delta Streamer] file name mismatch with meta when compaction running

Hi JianFeng
    It seems that there might be something wrong with the image so that I'm
not able to get the image in my side. Pleased to share some info about your
first question.
    The name of baseFile is comprised by {fileID}_writeToken_instant. For
write token, the method makeWriteToken in org.apache.hudi.common.fs.FSUtils
indicates how it is generated with three spark task information. As far as
I know, write token is designed to distinguish the files in same filegroup
generated by different task attempt.
    Let me share a scenario. In spark compaction job, speculation is
allowed. Two task attempt try to generate base file for the same filegroup,
so only the file written by the succeeded task can finally be picked by
hudi. We will use the file name returned by succeeded task to get the one
we want. reconcileAgainstMarkers method in class HoodieTable shows how this
process work.
    No idea on how this problem occur, it should not happen with default
config and hdfs. Hope these info could help you.
    By the way, there is a Wechat account shared some perfect articles in
chinese about hudi. For guys who are good at chinese, following article may
provide more information. Great thanks to the author.

https://mp.weixin.qq.com/s?__biz=MzIyMzQ0NjA0MQ==&mid=2247484306&idx=1&sn=1d853469159a600d82050c17e6a2a075&chksm=e81f56e4df68dff2da417109c4a971aef54f056bc0519558c58e23fe60b90dc6e4f8d7e92774&token=1688466117&lang=zh_CN#rd

On Wed, Oct 6, 2021 at 1:35 PM Jian Feng <ji...@shopee.com> wrote:

> when I run delta streamer(version 0.9) to ingest data from kafka to a
> Hbase indexed mor table ,  after few commits, met this error when
> compaction running
> [image: image.png]
>
>  In hdfs there is a file has same fileId and commit instant but different
> in the middle:
> hdfs://tl5/projects/data_vite/mysql_ingestion/rti_vite/shopee_item_v4_db__item_v4_tab_newHbase/BR/2021-10/813800cd-1aaf-43ea-829f-4feef4a51cb3-0_19-2672-4427765_
> *20211006051032*.parquet
>
> below is 20211006051032.commit's content,
>
>
> [image: image.png]
>
>
> What does 2672-4427765 and 2657-4368242 mean? and how can I fix this error?
>
> I tried recreate table , it happens again
>
>
> --
> *Jian Feng,冯健*
> Shopee | Engineer | Data Infrastructure
>

Re: [Phishing Risk] [External] [Delta Streamer] file name mismatch with meta when compaction running

Posted by Jian Feng <ji...@shopee.com>.
Can you see pictures here? https://github.com/apache/hudi/issues/3755
Thanks!  let me read that article , Im trying to create another Bloom index
mor table to see if problem still exists

On Wed, Oct 6, 2021 at 2:54 PM 管梓越 <gu...@bytedance.com> wrote:

> Hi JianFeng
>     It seems that there might be something wrong with the image so that I'm
> not able to get the image in my side. Pleased to share some info about your
> first question.
>     The name of baseFile is comprised by {fileID}_writeToken_instant. For
> write token, the method makeWriteToken in org.apache.hudi.common.fs.FSUtils
> indicates how it is generated with three spark task information. As far as
> I know, write token is designed to distinguish the files in same filegroup
> generated by different task attempt.
>     Let me share a scenario. In spark compaction job, speculation is
> allowed. Two task attempt try to generate base file for the same filegroup,
> so only the file written by the succeeded task can finally be picked by
> hudi. We will use the file name returned by succeeded task to get the one
> we want. reconcileAgainstMarkers method in class HoodieTable shows how this
> process work.
>     No idea on how this problem occur, it should not happen with default
> config and hdfs. Hope these info could help you.
>     By the way, there is a Wechat account shared some perfect articles in
> chinese about hudi. For guys who are good at chinese, following article may
> provide more information. Great thanks to the author.
>
>
> https://mp.weixin.qq.com/s?__biz=MzIyMzQ0NjA0MQ==&mid=2247484306&idx=1&sn=1d853469159a600d82050c17e6a2a075&chksm=e81f56e4df68dff2da417109c4a971aef54f056bc0519558c58e23fe60b90dc6e4f8d7e92774&token=1688466117&lang=zh_CN#rd
>
> On Wed, Oct 6, 2021 at 1:35 PM Jian Feng <ji...@shopee.com> wrote:
>
> > when I run delta streamer(version 0.9) to ingest data from kafka to a
> > Hbase indexed mor table ,  after few commits, met this error when
> > compaction running
> > [image: image.png]
> >
> >  In hdfs there is a file has same fileId and commit instant but different
> > in the middle:
> >
> hdfs://tl5/projects/data_vite/mysql_ingestion/rti_vite/shopee_item_v4_db__item_v4_tab_newHbase/BR/2021-10/813800cd-1aaf-43ea-829f-4feef4a51cb3-0_19-2672-4427765_
> > *20211006051032*.parquet
> >
> > below is 20211006051032.commit's content,
> >
> >
> > [image: image.png]
> >
> >
> > What does 2672-4427765 and 2657-4368242 mean? and how can I fix this
> error?
> >
> > I tried recreate table , it happens again
> >
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
>


-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure