You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Qian Wang <qw...@gmail.com> on 2019/10/06 16:15:03 UTC

Questions about using Hudi

Hi,

I have some questions when I try to use Hudi in my company’s prod env:

1. When I migrate the history table in HDFS, I tried use hudi-cli and HDFSParquetImporter tool. How can I specify Spark parameters in this tool, such as Yarn queue, etc?
2. Hudi needs to write metadata to Hive and it uses HiveMetastoreClient and HiveJDBC. How can I do if the Hive has Kerberos Authentication?

Thanks.

Best,
Qian

Re: Questions about using Hudi

Posted by Kabeer Ahmed <ka...@linuxmail.org>.

Hi Qian

I think you are using the default COW (Copy On Write). In your previous write you seem to have written 44G of data and then when you did a second write, another 44G of data seems to have been written. This seems to have doubled the size to 88G.
Can you please clear off all the data in the folder and then start from fresh and then report back the size please?
On Oct 12 2019, at 12:04 am, nishith agarwal <n3...@gmail.com> wrote:
> Qian,
>
> These columns will be present for every Hudi dataset. These columns are
> used to provide incremental queries on Hudi datasets so you can get
> changelogs and build incremental ETLs/pipelines.
>
> Thanks,
> Nishith
>
> On Fri, Oct 11, 2019 at 4:00 PM Qian Wang <qw...@gmail.com> wrote:
> > Hi,
> > I found that after I converted to Hudi managed dataset, there are added
> > several columns:
> >
> > _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key,
> > _hoodie_partition_path, _hoodie_file_name
> >
> > Does these columns added into table forever or temporary? Thanks.
> > Best,
> > Qian
> > On Oct 11, 2019, 3:39 PM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > Hi,
> > >
> > > I have successfully converted the parquet data into Hudi managed
> > dataset. However, I found that the previous data size is about 44G, after
> > converted by Hudi, the data size is about 88G. Why the data size increased
> > almost twice?
> > >
> > > Best,
> > > Qian
> > > On Oct 11, 2019, 1:57 PM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > > Hi Kabeer,
> > > >
> > > > Thanks for your detailed explanation. I will try it again. Will update
> > you the result.
> > > >
> > > > Best,
> > > > Qian
> > > > On Oct 11, 2019, 1:49 PM -0700, Kabeer Ahmed <ka...@linuxmail.org>,
> > >
> >
> > wrote:
> > > > > Hi Qian,
> > > > >
> > > > > If there are no nulls in the data, then most likey it is issue with
> > the data types being stored. I have seen this issue again and again and in
> > the recent one it was due to me storing double value when I had actually
> > declared the schema as IntegerType. I can reproduce this with an example to
> > prove the point. But I think you should look into your data.
> > > > > If possible I would recommend you run something like:
> > > >
> > >
> >
> > https://stackoverflow.com/questions/33270907/how-to-validate-contents-of-spark-dataframe
> > (
> > https://link.getmailspring.com/link/1A222369-02FF-464B-9E5E-48022A443BEA@getmailspring.com/0?redirect=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F33270907%2Fhow-to-validate-contents-of-spark-dataframe&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
> > This will show you if there is any value in any column that is against the
> > declared schema type. And when you fix that, the errors will go away.
> > > > > Keep us posted on how you get along with this.
> > > > > Thanks
> > > > > Kabeer.
> > > > >
> > > > > On Oct 9 2019, at 12:24 am, nishith agarwal <n3...@gmail.com>
> > wrote:
> > > > > > Hmm, AVRO is case-sensitive but I've not had issues reading fields
> > > > >
> > > >
> > >
> >
> > from
> > > > > > GenericRecords with lower or upper so I'm not 100% confident on
> > > > >
> > > >
> > >
> >
> > what the
> > > > > > resolution for a lower vs upper case is. Have you tried using the
> > > > > > partitionpath field names in upper case (in case your schema field
> > > > >
> > > >
> > >
> >
> > is also
> > > > > > upper case) ?
> > > > > >
> > > > > > -Nishith
> > > > > > On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <qw...@gmail.com>
> > > > >
> > > >
> > >
> >
> > wrote:
> > > > > > > Hi Nishith,
> > > > > > > I have checked the data, there is no null in that field. Does
> > > > > >
> > > > >
> > > >
> > >
> >
> > there has
> > > > > > > other possibility about this error?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Qian
> > > > > > > On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>,
> > > > > >
> > > > >
> > > >
> > >
> >
> > wrote:
> > > > > > > > Hi Nishith,
> > > > > > > >
> > > > > > > > Thanks for your response.
> > > > > > > > The session_date is one field in my original dataset. I have
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > some
> > > > > > >
> > > > > > > questions about the schema parameter:
> > > > > > > >
> > > > > > > > 1. Do I need create the target table?
> > > > > > > > 2. My source data is Parquet format, why the tool need the
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > schema file
> > > > > > >
> > > > > > > as the parameter?
> > > > > > > > 3. Can I use the schema file of Avro format?
> > > > > > > >
> > > > > > > > The schema is looks like:
> > > > > > > > {"type":"record","name":"PathExtractData","doc":"Path event
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > extract fact
> > > > > > > data”,”fields”:[
> > > > > > > > {“name”:”SESSION_DATE”,”type”:”string”},
> > > > > > > > {“name”:”SITE_ID”,”type”:”int”},
> > > > > > > > {“name”:”GUID”,”type”:”string”},
> > > > > > > > {“name”:”SESSION_KEY”,”type”:”long”},
> > > > > > > > {“name”:”USER_ID”,”type”:”string”},
> > > > > > > > {“name”:”STEP”,”type”:”int”},
> > > > > > > > {“name”:”PAGE_ID”,”type”:”int”}
> > > > > > > > ]}
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > Best,
> > > > > > > > Qian
> > > > > > > > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > n3.nash29@gmail.com>,
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > Qian,
> > > > > > > > >
> > > > > > > > > It looks like the partitionPathField that you specified
> > (session_date)
> > > > > > > is
> > > > > > > > > missing or the code is unable to grab it from your payload.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > Is this
> > > > > > > >
> > > > > > >
> > > > > > > field a
> > > > > > > > > top-level field or a nested field in your schema ?
> > > > > > > > > ( Currently, the HDFSImporterTool looks for your
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > partitionPathField
> > > > > > > >
> > > > > > >
> > > > > > > only at
> > > > > > > > > the top-level, for example genericRecord.get("session_date")
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > )
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Nishith
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <
> > qwang1023@gmail.com> wrote:
> > > > > > > > > > Hi,
> > > > > > > > > > Thanks for your response.
> > > > > > > > > > Now I tried to convert existing dataset to Hudi managed
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > dataset and
> > > > > > > I used
> > > > > > > > > > the hdfsparquestimport in hud-cli. I encountered following
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > error:
> > > > > > > > > >
> > > > > > > > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed:
> > countByKey at
> > > > > > > > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > > > > > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > occurred.
> > > > > > > > > > org.apache.hudi.exception.HoodieUpsertException: Failed to
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > upsert for
> > > > > > > > > > commit time 20191008095056
> > > > > > > > > >
> > > > > > > > > > Caused by: org.apache.hudi.exception.HoodieIOException:
> > partition
> > > > > > > key is
> > > > > > > > > > missing. :session_date
> > > > > > > > > >
> > > > > > > > > > My command in hud-cli as following:
> > > > > > > > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --targetPath
> > > > > > > > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --rowKeyField
> > > > > > > > > > _row_key --partitionPathField session_date --parallelism
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > 1500
> > > > > > > > > > --schemaFilePath /path/to/avro/schema --format parquet
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > --sparkMemory
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > 6g
> > > > > > > > > > --retry 2
> > > > > > > > > >
> > > > > > > > > > Could you please tell me how to solve this problem? Thanks.
> > > > > > > > > > Best,
> > > > > > > > > > Qian
> > > > > > > > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > qwang1023@gmail.com>,
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I have some questions when I try to use Hudi in my
> > company’s prod
> > > > > > > env:
> > > > > > > > > > >
> > > > > > > > > > > 1. When I migrate the history table in HDFS, I tried use
> > hudi-cli
> > > > > > > and
> > > > > > > > > > HDFSParquetImporter tool. How can I specify Spark
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > parameters in this
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > tool,
> > > > > > > > > > such as Yarn queue, etc?
> > > > > > > > > > > 2. Hudi needs to write metadata to Hive and it uses
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > HiveMetastoreClient
> > > > > > > > > > and HiveJDBC. How can I do if the Hive has Kerberos
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > Authentication?
> > > > > > > > > > >
> > > > > > > > > > > Thanks.
> > > > > > > > > > > Best,
> > > > > > > > > > > Qian
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>

Re: Questions about using Hudi

Posted by nishith agarwal <n3...@gmail.com>.

Qian,

These columns will be present for every Hudi dataset. These columns are
used to provide incremental queries on Hudi datasets so you can get
changelogs and build incremental ETLs/pipelines.

Thanks,
Nishith

On Fri, Oct 11, 2019 at 4:00 PM Qian Wang <qw...@gmail.com> wrote:

> Hi,
>
> I found that after I converted to Hudi managed dataset, there are added
> several columns:
>
> _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key,
> _hoodie_partition_path, _hoodie_file_name
>
> Does these columns added into table forever or temporary? Thanks.
>
> Best,
> Qian
> On Oct 11, 2019, 3:39 PM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > Hi,
> >
> > I have successfully converted the parquet data into Hudi managed
> dataset. However, I found that the previous data size is about 44G, after
> converted by Hudi, the data size is about 88G. Why the data size increased
> almost twice?
> >
> > Best,
> > Qian
> > On Oct 11, 2019, 1:57 PM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > Hi Kabeer,
> > >
> > > Thanks for your detailed explanation. I will try it again. Will update
> you the result.
> > >
> > > Best,
> > > Qian
> > > On Oct 11, 2019, 1:49 PM -0700, Kabeer Ahmed <ka...@linuxmail.org>,
> wrote:
> > > > Hi Qian,
> > > >
> > > > If there are no nulls in the data, then most likey it is issue with
> the data types being stored. I have seen this issue again and again and in
> the recent one it was due to me storing double value when I had actually
> declared the schema as IntegerType. I can reproduce this with an example to
> prove the point. But I think you should look into your data.
> > > > If possible I would recommend you run something like:
> https://stackoverflow.com/questions/33270907/how-to-validate-contents-of-spark-dataframe
> (
> https://link.getmailspring.com/link/1A222369-02FF-464B-9E5E-48022A443BEA@getmailspring.com/0?redirect=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F33270907%2Fhow-to-validate-contents-of-spark-dataframe&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
> This will show you if there is any value in any column that is against the
> declared schema type. And when you fix that, the errors will go away.
> > > > Keep us posted on how you get along with this.
> > > > Thanks
> > > > Kabeer.
> > > >
> > > > On Oct 9 2019, at 12:24 am, nishith agarwal <n3...@gmail.com>
> wrote:
> > > > > Hmm, AVRO is case-sensitive but I've not had issues reading fields
> from
> > > > > GenericRecords with lower or upper so I'm not 100% confident on
> what the
> > > > > resolution for a lower vs upper case is. Have you tried using the
> > > > > partitionpath field names in upper case (in case your schema field
> is also
> > > > > upper case) ?
> > > > >
> > > > > -Nishith
> > > > > On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <qw...@gmail.com>
> wrote:
> > > > > > Hi Nishith,
> > > > > > I have checked the data, there is no null in that field. Does
> there has
> > > > > > other possibility about this error?
> > > > > >
> > > > > > Thanks,
> > > > > > Qian
> > > > > > On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>,
> wrote:
> > > > > > > Hi Nishith,
> > > > > > >
> > > > > > > Thanks for your response.
> > > > > > > The session_date is one field in my original dataset. I have
> some
> > > > > >
> > > > > > questions about the schema parameter:
> > > > > > >
> > > > > > > 1. Do I need create the target table?
> > > > > > > 2. My source data is Parquet format, why the tool need the
> schema file
> > > > > >
> > > > > > as the parameter?
> > > > > > > 3. Can I use the schema file of Avro format?
> > > > > > >
> > > > > > > The schema is looks like:
> > > > > > > {"type":"record","name":"PathExtractData","doc":"Path event
> extract fact
> > > > > > data”,”fields”:[
> > > > > > > {“name”:”SESSION_DATE”,”type”:”string”},
> > > > > > > {“name”:”SITE_ID”,”type”:”int”},
> > > > > > > {“name”:”GUID”,”type”:”string”},
> > > > > > > {“name”:”SESSION_KEY”,”type”:”long”},
> > > > > > > {“name”:”USER_ID”,”type”:”string”},
> > > > > > > {“name”:”STEP”,”type”:”int”},
> > > > > > > {“name”:”PAGE_ID”,”type”:”int”}
> > > > > > > ]}
> > > > > > >
> > > > > > > Thanks.
> > > > > > > Best,
> > > > > > > Qian
> > > > > > > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <
> n3.nash29@gmail.com>,
> > > > > >
> > > > > > wrote:
> > > > > > > > Qian,
> > > > > > > >
> > > > > > > > It looks like the partitionPathField that you specified
> (session_date)
> > > > > > is
> > > > > > > > missing or the code is unable to grab it from your payload.
> Is this
> > > > > > >
> > > > > >
> > > > > > field a
> > > > > > > > top-level field or a nested field in your schema ?
> > > > > > > > ( Currently, the HDFSImporterTool looks for your
> partitionPathField
> > > > > > >
> > > > > >
> > > > > > only at
> > > > > > > > the top-level, for example genericRecord.get("session_date")
> )
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Nishith
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <
> qwang1023@gmail.com> wrote:
> > > > > > > > > Hi,
> > > > > > > > > Thanks for your response.
> > > > > > > > > Now I tried to convert existing dataset to Hudi managed
> dataset and
> > > > > > I used
> > > > > > > > > the hdfsparquestimport in hud-cli. I encountered following
> error:
> > > > > > > > >
> > > > > > > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed:
> countByKey at
> > > > > > > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > > > > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error
> occurred.
> > > > > > > > > org.apache.hudi.exception.HoodieUpsertException: Failed to
> upsert for
> > > > > > > > > commit time 20191008095056
> > > > > > > > >
> > > > > > > > > Caused by: org.apache.hudi.exception.HoodieIOException:
> partition
> > > > > > key is
> > > > > > > > > missing. :session_date
> > > > > > > > >
> > > > > > > > > My command in hud-cli as following:
> > > > > > > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > --targetPath
> > > > > > > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > --rowKeyField
> > > > > > > > > _row_key --partitionPathField session_date --parallelism
> 1500
> > > > > > > > > --schemaFilePath /path/to/avro/schema --format parquet
> --sparkMemory
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > 6g
> > > > > > > > > --retry 2
> > > > > > > > >
> > > > > > > > > Could you please tell me how to solve this problem? Thanks.
> > > > > > > > > Best,
> > > > > > > > > Qian
> > > > > > > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <
> qwang1023@gmail.com>,
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > wrote:
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I have some questions when I try to use Hudi in my
> company’s prod
> > > > > > env:
> > > > > > > > > >
> > > > > > > > > > 1. When I migrate the history table in HDFS, I tried use
> hudi-cli
> > > > > > and
> > > > > > > > > HDFSParquetImporter tool. How can I specify Spark
> parameters in this
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > tool,
> > > > > > > > > such as Yarn queue, etc?
> > > > > > > > > > 2. Hudi needs to write metadata to Hive and it uses
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > HiveMetastoreClient
> > > > > > > > > and HiveJDBC. How can I do if the Hive has Kerberos
> Authentication?
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > > Best,
> > > > > > > > > > Qian
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
>

Re: Questions about using Hudi

Posted by Qian Wang <qw...@gmail.com>.

Hi,

I found that after I converted to Hudi managed dataset, there are added several columns:

_hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, _hoodie_file_name

Does these columns added into table forever or temporary? Thanks.

Best,
Qian
On Oct 11, 2019, 3:39 PM -0700, Qian Wang <qw...@gmail.com>, wrote:
> Hi,
>
> I have successfully converted the parquet data into Hudi managed dataset. However, I found that the previous data size is about 44G, after converted by Hudi, the data size is about 88G. Why the data size increased almost twice?
>
> Best,
> Qian
> On Oct 11, 2019, 1:57 PM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > Hi Kabeer,
> >
> > Thanks for your detailed explanation. I will try it again. Will update you the result.
> >
> > Best,
> > Qian
> > On Oct 11, 2019, 1:49 PM -0700, Kabeer Ahmed <ka...@linuxmail.org>, wrote:
> > > Hi Qian,
> > >
> > > If there are no nulls in the data, then most likey it is issue with the data types being stored. I have seen this issue again and again and in the recent one it was due to me storing double value when I had actually declared the schema as IntegerType. I can reproduce this with an example to prove the point. But I think you should look into your data.
> > > If possible I would recommend you run something like: https://stackoverflow.com/questions/33270907/how-to-validate-contents-of-spark-dataframe (https://link.getmailspring.com/link/1A222369-02FF-464B-9E5E-48022A443BEA@getmailspring.com/0?redirect=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F33270907%2Fhow-to-validate-contents-of-spark-dataframe&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D). This will show you if there is any value in any column that is against the declared schema type. And when you fix that, the errors will go away.
> > > Keep us posted on how you get along with this.
> > > Thanks
> > > Kabeer.
> > >
> > > On Oct 9 2019, at 12:24 am, nishith agarwal <n3...@gmail.com> wrote:
> > > > Hmm, AVRO is case-sensitive but I've not had issues reading fields from
> > > > GenericRecords with lower or upper so I'm not 100% confident on what the
> > > > resolution for a lower vs upper case is. Have you tried using the
> > > > partitionpath field names in upper case (in case your schema field is also
> > > > upper case) ?
> > > >
> > > > -Nishith
> > > > On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <qw...@gmail.com> wrote:
> > > > > Hi Nishith,
> > > > > I have checked the data, there is no null in that field. Does there has
> > > > > other possibility about this error?
> > > > >
> > > > > Thanks,
> > > > > Qian
> > > > > On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > > > > Hi Nishith,
> > > > > >
> > > > > > Thanks for your response.
> > > > > > The session_date is one field in my original dataset. I have some
> > > > >
> > > > > questions about the schema parameter:
> > > > > >
> > > > > > 1. Do I need create the target table?
> > > > > > 2. My source data is Parquet format, why the tool need the schema file
> > > > >
> > > > > as the parameter?
> > > > > > 3. Can I use the schema file of Avro format?
> > > > > >
> > > > > > The schema is looks like:
> > > > > > {"type":"record","name":"PathExtractData","doc":"Path event extract fact
> > > > > data”,”fields”:[
> > > > > > {“name”:”SESSION_DATE”,”type”:”string”},
> > > > > > {“name”:”SITE_ID”,”type”:”int”},
> > > > > > {“name”:”GUID”,”type”:”string”},
> > > > > > {“name”:”SESSION_KEY”,”type”:”long”},
> > > > > > {“name”:”USER_ID”,”type”:”string”},
> > > > > > {“name”:”STEP”,”type”:”int”},
> > > > > > {“name”:”PAGE_ID”,”type”:”int”}
> > > > > > ]}
> > > > > >
> > > > > > Thanks.
> > > > > > Best,
> > > > > > Qian
> > > > > > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3...@gmail.com>,
> > > > >
> > > > > wrote:
> > > > > > > Qian,
> > > > > > >
> > > > > > > It looks like the partitionPathField that you specified (session_date)
> > > > > is
> > > > > > > missing or the code is unable to grab it from your payload. Is this
> > > > > >
> > > > >
> > > > > field a
> > > > > > > top-level field or a nested field in your schema ?
> > > > > > > ( Currently, the HDFSImporterTool looks for your partitionPathField
> > > > > >
> > > > >
> > > > > only at
> > > > > > > the top-level, for example genericRecord.get("session_date") )
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Nishith
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:
> > > > > > > > Hi,
> > > > > > > > Thanks for your response.
> > > > > > > > Now I tried to convert existing dataset to Hudi managed dataset and
> > > > > I used
> > > > > > > > the hdfsparquestimport in hud-cli. I encountered following error:
> > > > > > > >
> > > > > > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > > > > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > > > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > > > > > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > > > > > > commit time 20191008095056
> > > > > > > >
> > > > > > > > Caused by: org.apache.hudi.exception.HoodieIOException: partition
> > > > > key is
> > > > > > > > missing. :session_date
> > > > > > > >
> > > > > > > > My command in hud-cli as following:
> > > > > > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> > > > > > >
> > > > > >
> > > > >
> > > > > --targetPath
> > > > > > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> > > > > > >
> > > > > >
> > > > >
> > > > > --rowKeyField
> > > > > > > > _row_key --partitionPathField session_date --parallelism 1500
> > > > > > > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory
> > > > > > >
> > > > > >
> > > > >
> > > > > 6g
> > > > > > > > --retry 2
> > > > > > > >
> > > > > > > > Could you please tell me how to solve this problem? Thanks.
> > > > > > > > Best,
> > > > > > > > Qian
> > > > > > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>,
> > > > > > >
> > > > > >
> > > > >
> > > > > wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I have some questions when I try to use Hudi in my company’s prod
> > > > > env:
> > > > > > > > >
> > > > > > > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli
> > > > > and
> > > > > > > > HDFSParquetImporter tool. How can I specify Spark parameters in this
> > > > > > >
> > > > > >
> > > > >
> > > > > tool,
> > > > > > > > such as Yarn queue, etc?
> > > > > > > > > 2. Hudi needs to write metadata to Hive and it uses
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > HiveMetastoreClient
> > > > > > > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > > Best,
> > > > > > > > > Qian
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >

Re: Questions about using Hudi

Posted by nishith agarwal <n3...@gmail.com>.

Hi Qian,

1. Did you make sure that every time you run the conversion tool, you're
writing to a new path ?
2. What were the parquet file sizes of your original dataset and what is
the parquet file size you configured in hudi ? Compression can play a
factor here...

Thanks,
Nishith

On Fri, Oct 11, 2019 at 3:40 PM Qian Wang <qw...@gmail.com> wrote:

> Hi,
>
> I have successfully converted the parquet data into Hudi managed dataset.
> However, I found that the previous data size is about 44G, after converted
> by Hudi, the data size is about 88G. Why the data size increased almost
> twice?
>
> Best,
> Qian
> On Oct 11, 2019, 1:57 PM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > Hi Kabeer,
> >
> > Thanks for your detailed explanation. I will try it again. Will update
> you the result.
> >
> > Best,
> > Qian
> > On Oct 11, 2019, 1:49 PM -0700, Kabeer Ahmed <ka...@linuxmail.org>,
> wrote:
> > > Hi Qian,
> > >
> > > If there are no nulls in the data, then most likey it is issue with
> the data types being stored. I have seen this issue again and again and in
> the recent one it was due to me storing double value when I had actually
> declared the schema as IntegerType. I can reproduce this with an example to
> prove the point. But I think you should look into your data.
> > > If possible I would recommend you run something like:
> https://stackoverflow.com/questions/33270907/how-to-validate-contents-of-spark-dataframe
> (
> https://link.getmailspring.com/link/1A222369-02FF-464B-9E5E-48022A443BEA@getmailspring.com/0?redirect=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F33270907%2Fhow-to-validate-contents-of-spark-dataframe&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
> This will show you if there is any value in any column that is against the
> declared schema type. And when you fix that, the errors will go away.
> > > Keep us posted on how you get along with this.
> > > Thanks
> > > Kabeer.
> > >
> > > On Oct 9 2019, at 12:24 am, nishith agarwal <n3...@gmail.com>
> wrote:
> > > > Hmm, AVRO is case-sensitive but I've not had issues reading fields
> from
> > > > GenericRecords with lower or upper so I'm not 100% confident on what
> the
> > > > resolution for a lower vs upper case is. Have you tried using the
> > > > partitionpath field names in upper case (in case your schema field
> is also
> > > > upper case) ?
> > > >
> > > > -Nishith
> > > > On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <qw...@gmail.com>
> wrote:
> > > > > Hi Nishith,
> > > > > I have checked the data, there is no null in that field. Does
> there has
> > > > > other possibility about this error?
> > > > >
> > > > > Thanks,
> > > > > Qian
> > > > > On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>,
> wrote:
> > > > > > Hi Nishith,
> > > > > >
> > > > > > Thanks for your response.
> > > > > > The session_date is one field in my original dataset. I have some
> > > > >
> > > > > questions about the schema parameter:
> > > > > >
> > > > > > 1. Do I need create the target table?
> > > > > > 2. My source data is Parquet format, why the tool need the
> schema file
> > > > >
> > > > > as the parameter?
> > > > > > 3. Can I use the schema file of Avro format?
> > > > > >
> > > > > > The schema is looks like:
> > > > > > {"type":"record","name":"PathExtractData","doc":"Path event
> extract fact
> > > > > data”,”fields”:[
> > > > > > {“name”:”SESSION_DATE”,”type”:”string”},
> > > > > > {“name”:”SITE_ID”,”type”:”int”},
> > > > > > {“name”:”GUID”,”type”:”string”},
> > > > > > {“name”:”SESSION_KEY”,”type”:”long”},
> > > > > > {“name”:”USER_ID”,”type”:”string”},
> > > > > > {“name”:”STEP”,”type”:”int”},
> > > > > > {“name”:”PAGE_ID”,”type”:”int”}
> > > > > > ]}
> > > > > >
> > > > > > Thanks.
> > > > > > Best,
> > > > > > Qian
> > > > > > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <
> n3.nash29@gmail.com>,
> > > > >
> > > > > wrote:
> > > > > > > Qian,
> > > > > > >
> > > > > > > It looks like the partitionPathField that you specified
> (session_date)
> > > > > is
> > > > > > > missing or the code is unable to grab it from your payload. Is
> this
> > > > > >
> > > > >
> > > > > field a
> > > > > > > top-level field or a nested field in your schema ?
> > > > > > > ( Currently, the HDFSImporterTool looks for your
> partitionPathField
> > > > > >
> > > > >
> > > > > only at
> > > > > > > the top-level, for example genericRecord.get("session_date") )
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Nishith
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com>
> wrote:
> > > > > > > > Hi,
> > > > > > > > Thanks for your response.
> > > > > > > > Now I tried to convert existing dataset to Hudi managed
> dataset and
> > > > > I used
> > > > > > > > the hdfsparquestimport in hud-cli. I encountered following
> error:
> > > > > > > >
> > > > > > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed:
> countByKey at
> > > > > > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > > > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > > > > > > org.apache.hudi.exception.HoodieUpsertException: Failed to
> upsert for
> > > > > > > > commit time 20191008095056
> > > > > > > >
> > > > > > > > Caused by: org.apache.hudi.exception.HoodieIOException:
> partition
> > > > > key is
> > > > > > > > missing. :session_date
> > > > > > > >
> > > > > > > > My command in hud-cli as following:
> > > > > > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> > > > > > >
> > > > > >
> > > > >
> > > > > --targetPath
> > > > > > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> > > > > > >
> > > > > >
> > > > >
> > > > > --rowKeyField
> > > > > > > > _row_key --partitionPathField session_date --parallelism 1500
> > > > > > > > --schemaFilePath /path/to/avro/schema --format parquet
> --sparkMemory
> > > > > > >
> > > > > >
> > > > >
> > > > > 6g
> > > > > > > > --retry 2
> > > > > > > >
> > > > > > > > Could you please tell me how to solve this problem? Thanks.
> > > > > > > > Best,
> > > > > > > > Qian
> > > > > > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <
> qwang1023@gmail.com>,
> > > > > > >
> > > > > >
> > > > >
> > > > > wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I have some questions when I try to use Hudi in my
> company’s prod
> > > > > env:
> > > > > > > > >
> > > > > > > > > 1. When I migrate the history table in HDFS, I tried use
> hudi-cli
> > > > > and
> > > > > > > > HDFSParquetImporter tool. How can I specify Spark parameters
> in this
> > > > > > >
> > > > > >
> > > > >
> > > > > tool,
> > > > > > > > such as Yarn queue, etc?
> > > > > > > > > 2. Hudi needs to write metadata to Hive and it uses
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > HiveMetastoreClient
> > > > > > > > and HiveJDBC. How can I do if the Hive has Kerberos
> Authentication?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > > Best,
> > > > > > > > > Qian
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
>

Re: Questions about using Hudi

Posted by Qian Wang <qw...@gmail.com>.

Hi,

I have successfully converted the parquet data into Hudi managed dataset. However, I found that the previous data size is about 44G, after converted by Hudi, the data size is about 88G. Why the data size increased almost twice?

Best,
Qian
On Oct 11, 2019, 1:57 PM -0700, Qian Wang <qw...@gmail.com>, wrote:
> Hi Kabeer,
>
> Thanks for your detailed explanation. I will try it again. Will update you the result.
>
> Best,
> Qian
> On Oct 11, 2019, 1:49 PM -0700, Kabeer Ahmed <ka...@linuxmail.org>, wrote:
> > Hi Qian,
> >
> > If there are no nulls in the data, then most likey it is issue with the data types being stored. I have seen this issue again and again and in the recent one it was due to me storing double value when I had actually declared the schema as IntegerType. I can reproduce this with an example to prove the point. But I think you should look into your data.
> > If possible I would recommend you run something like: https://stackoverflow.com/questions/33270907/how-to-validate-contents-of-spark-dataframe (https://link.getmailspring.com/link/1A222369-02FF-464B-9E5E-48022A443BEA@getmailspring.com/0?redirect=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F33270907%2Fhow-to-validate-contents-of-spark-dataframe&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D). This will show you if there is any value in any column that is against the declared schema type. And when you fix that, the errors will go away.
> > Keep us posted on how you get along with this.
> > Thanks
> > Kabeer.
> >
> > On Oct 9 2019, at 12:24 am, nishith agarwal <n3...@gmail.com> wrote:
> > > Hmm, AVRO is case-sensitive but I've not had issues reading fields from
> > > GenericRecords with lower or upper so I'm not 100% confident on what the
> > > resolution for a lower vs upper case is. Have you tried using the
> > > partitionpath field names in upper case (in case your schema field is also
> > > upper case) ?
> > >
> > > -Nishith
> > > On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <qw...@gmail.com> wrote:
> > > > Hi Nishith,
> > > > I have checked the data, there is no null in that field. Does there has
> > > > other possibility about this error?
> > > >
> > > > Thanks,
> > > > Qian
> > > > On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > > > Hi Nishith,
> > > > >
> > > > > Thanks for your response.
> > > > > The session_date is one field in my original dataset. I have some
> > > >
> > > > questions about the schema parameter:
> > > > >
> > > > > 1. Do I need create the target table?
> > > > > 2. My source data is Parquet format, why the tool need the schema file
> > > >
> > > > as the parameter?
> > > > > 3. Can I use the schema file of Avro format?
> > > > >
> > > > > The schema is looks like:
> > > > > {"type":"record","name":"PathExtractData","doc":"Path event extract fact
> > > > data”,”fields”:[
> > > > > {“name”:”SESSION_DATE”,”type”:”string”},
> > > > > {“name”:”SITE_ID”,”type”:”int”},
> > > > > {“name”:”GUID”,”type”:”string”},
> > > > > {“name”:”SESSION_KEY”,”type”:”long”},
> > > > > {“name”:”USER_ID”,”type”:”string”},
> > > > > {“name”:”STEP”,”type”:”int”},
> > > > > {“name”:”PAGE_ID”,”type”:”int”}
> > > > > ]}
> > > > >
> > > > > Thanks.
> > > > > Best,
> > > > > Qian
> > > > > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3...@gmail.com>,
> > > >
> > > > wrote:
> > > > > > Qian,
> > > > > >
> > > > > > It looks like the partitionPathField that you specified (session_date)
> > > > is
> > > > > > missing or the code is unable to grab it from your payload. Is this
> > > > >
> > > >
> > > > field a
> > > > > > top-level field or a nested field in your schema ?
> > > > > > ( Currently, the HDFSImporterTool looks for your partitionPathField
> > > > >
> > > >
> > > > only at
> > > > > > the top-level, for example genericRecord.get("session_date") )
> > > > > >
> > > > > > Thanks,
> > > > > > Nishith
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:
> > > > > > > Hi,
> > > > > > > Thanks for your response.
> > > > > > > Now I tried to convert existing dataset to Hudi managed dataset and
> > > > I used
> > > > > > > the hdfsparquestimport in hud-cli. I encountered following error:
> > > > > > >
> > > > > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > > > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > > > > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > > > > > commit time 20191008095056
> > > > > > >
> > > > > > > Caused by: org.apache.hudi.exception.HoodieIOException: partition
> > > > key is
> > > > > > > missing. :session_date
> > > > > > >
> > > > > > > My command in hud-cli as following:
> > > > > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> > > > > >
> > > > >
> > > >
> > > > --targetPath
> > > > > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> > > > > >
> > > > >
> > > >
> > > > --rowKeyField
> > > > > > > _row_key --partitionPathField session_date --parallelism 1500
> > > > > > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory
> > > > > >
> > > > >
> > > >
> > > > 6g
> > > > > > > --retry 2
> > > > > > >
> > > > > > > Could you please tell me how to solve this problem? Thanks.
> > > > > > > Best,
> > > > > > > Qian
> > > > > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>,
> > > > > >
> > > > >
> > > >
> > > > wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I have some questions when I try to use Hudi in my company’s prod
> > > > env:
> > > > > > > >
> > > > > > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli
> > > > and
> > > > > > > HDFSParquetImporter tool. How can I specify Spark parameters in this
> > > > > >
> > > > >
> > > >
> > > > tool,
> > > > > > > such as Yarn queue, etc?
> > > > > > > > 2. Hudi needs to write metadata to Hive and it uses
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > HiveMetastoreClient
> > > > > > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > Best,
> > > > > > > > Qian
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> >

Re: Questions about using Hudi

Posted by Qian Wang <qw...@gmail.com>.

Hi Kabeer,

Thanks for your detailed explanation. I will try it again. Will update you the result.

Best,
Qian
On Oct 11, 2019, 1:49 PM -0700, Kabeer Ahmed <ka...@linuxmail.org>, wrote:
> Hi Qian,
>
> If there are no nulls in the data, then most likey it is issue with the data types being stored. I have seen this issue again and again and in the recent one it was due to me storing double value when I had actually declared the schema as IntegerType. I can reproduce this with an example to prove the point. But I think you should look into your data.
> If possible I would recommend you run something like: https://stackoverflow.com/questions/33270907/how-to-validate-contents-of-spark-dataframe (https://link.getmailspring.com/link/1A222369-02FF-464B-9E5E-48022A443BEA@getmailspring.com/0?redirect=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F33270907%2Fhow-to-validate-contents-of-spark-dataframe&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D). This will show you if there is any value in any column that is against the declared schema type. And when you fix that, the errors will go away.
> Keep us posted on how you get along with this.
> Thanks
> Kabeer.
>
> On Oct 9 2019, at 12:24 am, nishith agarwal <n3...@gmail.com> wrote:
> > Hmm, AVRO is case-sensitive but I've not had issues reading fields from
> > GenericRecords with lower or upper so I'm not 100% confident on what the
> > resolution for a lower vs upper case is. Have you tried using the
> > partitionpath field names in upper case (in case your schema field is also
> > upper case) ?
> >
> > -Nishith
> > On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <qw...@gmail.com> wrote:
> > > Hi Nishith,
> > > I have checked the data, there is no null in that field. Does there has
> > > other possibility about this error?
> > >
> > > Thanks,
> > > Qian
> > > On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > > Hi Nishith,
> > > >
> > > > Thanks for your response.
> > > > The session_date is one field in my original dataset. I have some
> > >
> > > questions about the schema parameter:
> > > >
> > > > 1. Do I need create the target table?
> > > > 2. My source data is Parquet format, why the tool need the schema file
> > >
> > > as the parameter?
> > > > 3. Can I use the schema file of Avro format?
> > > >
> > > > The schema is looks like:
> > > > {"type":"record","name":"PathExtractData","doc":"Path event extract fact
> > > data”,”fields”:[
> > > > {“name”:”SESSION_DATE”,”type”:”string”},
> > > > {“name”:”SITE_ID”,”type”:”int”},
> > > > {“name”:”GUID”,”type”:”string”},
> > > > {“name”:”SESSION_KEY”,”type”:”long”},
> > > > {“name”:”USER_ID”,”type”:”string”},
> > > > {“name”:”STEP”,”type”:”int”},
> > > > {“name”:”PAGE_ID”,”type”:”int”}
> > > > ]}
> > > >
> > > > Thanks.
> > > > Best,
> > > > Qian
> > > > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3...@gmail.com>,
> > >
> > > wrote:
> > > > > Qian,
> > > > >
> > > > > It looks like the partitionPathField that you specified (session_date)
> > > is
> > > > > missing or the code is unable to grab it from your payload. Is this
> > > >
> > >
> > > field a
> > > > > top-level field or a nested field in your schema ?
> > > > > ( Currently, the HDFSImporterTool looks for your partitionPathField
> > > >
> > >
> > > only at
> > > > > the top-level, for example genericRecord.get("session_date") )
> > > > >
> > > > > Thanks,
> > > > > Nishith
> > > > >
> > > > >
> > > > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:
> > > > > > Hi,
> > > > > > Thanks for your response.
> > > > > > Now I tried to convert existing dataset to Hudi managed dataset and
> > > I used
> > > > > > the hdfsparquestimport in hud-cli. I encountered following error:
> > > > > >
> > > > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > > > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > > > > commit time 20191008095056
> > > > > >
> > > > > > Caused by: org.apache.hudi.exception.HoodieIOException: partition
> > > key is
> > > > > > missing. :session_date
> > > > > >
> > > > > > My command in hud-cli as following:
> > > > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> > > > >
> > > >
> > >
> > > --targetPath
> > > > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> > > > >
> > > >
> > >
> > > --rowKeyField
> > > > > > _row_key --partitionPathField session_date --parallelism 1500
> > > > > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory
> > > > >
> > > >
> > >
> > > 6g
> > > > > > --retry 2
> > > > > >
> > > > > > Could you please tell me how to solve this problem? Thanks.
> > > > > > Best,
> > > > > > Qian
> > > > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>,
> > > > >
> > > >
> > >
> > > wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have some questions when I try to use Hudi in my company’s prod
> > > env:
> > > > > > >
> > > > > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli
> > > and
> > > > > > HDFSParquetImporter tool. How can I specify Spark parameters in this
> > > > >
> > > >
> > >
> > > tool,
> > > > > > such as Yarn queue, etc?
> > > > > > > 2. Hudi needs to write metadata to Hive and it uses
> > > > > >
> > > > >
> > > >
> > >
> > > HiveMetastoreClient
> > > > > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > > > > >
> > > > > > > Thanks.
> > > > > > > Best,
> > > > > > > Qian
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>

Re: Questions about using Hudi

Posted by Kabeer Ahmed <ka...@linuxmail.org>.

Hi Qian,

If there are no nulls in the data, then most likey it is issue with the data types being stored. I have seen this issue again and again and in the recent one it was due to me storing double value when I had actually declared the schema as IntegerType. I can reproduce this with an example to prove the point. But I think you should look into your data.
If possible I would recommend you run something like: https://stackoverflow.com/questions/33270907/how-to-validate-contents-of-spark-dataframe (https://link.getmailspring.com/link/1A222369-02FF-464B-9E5E-48022A443BEA@getmailspring.com/0?redirect=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F33270907%2Fhow-to-validate-contents-of-spark-dataframe&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D). This will show you if there is any value in any column that is against the declared schema type. And when you fix that, the errors will go away.
Keep us posted on how you get along with this.
Thanks
Kabeer.

On Oct 9 2019, at 12:24 am, nishith agarwal <n3...@gmail.com> wrote:
> Hmm, AVRO is case-sensitive but I've not had issues reading fields from
> GenericRecords with lower or upper so I'm not 100% confident on what the
> resolution for a lower vs upper case is. Have you tried using the
> partitionpath field names in upper case (in case your schema field is also
> upper case) ?
>
> -Nishith
> On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <qw...@gmail.com> wrote:
> > Hi Nishith,
> > I have checked the data, there is no null in that field. Does there has
> > other possibility about this error?
> >
> > Thanks,
> > Qian
> > On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > Hi Nishith,
> > >
> > > Thanks for your response.
> > > The session_date is one field in my original dataset. I have some
> >
> > questions about the schema parameter:
> > >
> > > 1. Do I need create the target table?
> > > 2. My source data is Parquet format, why the tool need the schema file
> >
> > as the parameter?
> > > 3. Can I use the schema file of Avro format?
> > >
> > > The schema is looks like:
> > > {"type":"record","name":"PathExtractData","doc":"Path event extract fact
> > data”,”fields”:[
> > > {“name”:”SESSION_DATE”,”type”:”string”},
> > > {“name”:”SITE_ID”,”type”:”int”},
> > > {“name”:”GUID”,”type”:”string”},
> > > {“name”:”SESSION_KEY”,”type”:”long”},
> > > {“name”:”USER_ID”,”type”:”string”},
> > > {“name”:”STEP”,”type”:”int”},
> > > {“name”:”PAGE_ID”,”type”:”int”}
> > > ]}
> > >
> > > Thanks.
> > > Best,
> > > Qian
> > > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3...@gmail.com>,
> >
> > wrote:
> > > > Qian,
> > > >
> > > > It looks like the partitionPathField that you specified (session_date)
> > is
> > > > missing or the code is unable to grab it from your payload. Is this
> > >
> >
> > field a
> > > > top-level field or a nested field in your schema ?
> > > > ( Currently, the HDFSImporterTool looks for your partitionPathField
> > >
> >
> > only at
> > > > the top-level, for example genericRecord.get("session_date") )
> > > >
> > > > Thanks,
> > > > Nishith
> > > >
> > > >
> > > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:
> > > > > Hi,
> > > > > Thanks for your response.
> > > > > Now I tried to convert existing dataset to Hudi managed dataset and
> > I used
> > > > > the hdfsparquestimport in hud-cli. I encountered following error:
> > > > >
> > > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > > > commit time 20191008095056
> > > > >
> > > > > Caused by: org.apache.hudi.exception.HoodieIOException: partition
> > key is
> > > > > missing. :session_date
> > > > >
> > > > > My command in hud-cli as following:
> > > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> > > >
> > >
> >
> > --targetPath
> > > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> > > >
> > >
> >
> > --rowKeyField
> > > > > _row_key --partitionPathField session_date --parallelism 1500
> > > > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory
> > > >
> > >
> >
> > 6g
> > > > > --retry 2
> > > > >
> > > > > Could you please tell me how to solve this problem? Thanks.
> > > > > Best,
> > > > > Qian
> > > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>,
> > > >
> > >
> >
> > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I have some questions when I try to use Hudi in my company’s prod
> > env:
> > > > > >
> > > > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli
> > and
> > > > > HDFSParquetImporter tool. How can I specify Spark parameters in this
> > > >
> > >
> >
> > tool,
> > > > > such as Yarn queue, etc?
> > > > > > 2. Hudi needs to write metadata to Hive and it uses
> > > > >
> > > >
> > >
> >
> > HiveMetastoreClient
> > > > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > > > >
> > > > > > Thanks.
> > > > > > Best,
> > > > > > Qian
> > > > >
> > > >
> > >
> >
>
>

Re: Questions about using Hudi

Posted by nishith agarwal <n3...@gmail.com>.

Hmm, AVRO is case-sensitive but I've not had issues reading fields from
GenericRecords with lower or upper so I'm not 100% confident on what the
resolution for a lower vs upper case is. Have you tried using the
partitionpath field names in upper case (in case your schema field is also
upper case) ?

-Nishith

On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <qw...@gmail.com> wrote:

> Hi Nishith,
>
> I have checked the data, there is no null in that field. Does there has
> other possibility about this error?
>
> Thanks,
> Qian
> On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > Hi Nishith,
> >
> > Thanks for your response.
> > The session_date is one field in my original dataset. I have some
> questions about the schema parameter:
> >
> > 1. Do I need create the target table?
> > 2. My source data is Parquet format, why the tool need the schema file
> as the parameter?
> > 3. Can I use the schema file of Avro format?
> >
> > The schema is looks like:
> >
> > {"type":"record","name":"PathExtractData","doc":"Path event extract fact
> data”,”fields”:[
> >     {“name”:”SESSION_DATE”,”type”:”string”},
> >     {“name”:”SITE_ID”,”type”:”int”},
> >     {“name”:”GUID”,”type”:”string”},
> >     {“name”:”SESSION_KEY”,”type”:”long”},
> >     {“name”:”USER_ID”,”type”:”string”},
> >     {“name”:”STEP”,”type”:”int”},
> >     {“name”:”PAGE_ID”,”type”:”int”}
> > ]}
> >
> > Thanks.
> >
> > Best,
> > Qian
> > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3...@gmail.com>,
> wrote:
> > > Qian,
> > >
> > > It looks like the partitionPathField that you specified (session_date)
> is
> > > missing or the code is unable to grab it from your payload. Is this
> field a
> > > top-level field or a nested field in your schema ?
> > > ( Currently, the HDFSImporterTool looks for your partitionPathField
> only at
> > > the top-level, for example genericRecord.get("session_date") )
> > >
> > > Thanks,
> > > Nishith
> > >
> > >
> > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for your response.
> > > >
> > > > Now I tried to convert existing dataset to Hudi managed dataset and
> I used
> > > > the hdfsparquestimport in hud-cli. I encountered following error:
> > > >
> > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > > commit time 20191008095056
> > > >
> > > > Caused by: org.apache.hudi.exception.HoodieIOException: partition
> key is
> > > > missing. :session_date
> > > >
> > > > My command in hud-cli as following:
> > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> --targetPath
> > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> --rowKeyField
> > > > _row_key --partitionPathField session_date --parallelism 1500
> > > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory
> 6g
> > > > --retry 2
> > > >
> > > > Could you please tell me how to solve this problem? Thanks.
> > > >
> > > > Best,
> > > > Qian
> > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>,
> wrote:
> > > > > Hi,
> > > > >
> > > > > I have some questions when I try to use Hudi in my company’s prod
> env:
> > > > >
> > > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli
> and
> > > > HDFSParquetImporter tool. How can I specify Spark parameters in this
> tool,
> > > > such as Yarn queue, etc?
> > > > > 2. Hudi needs to write metadata to Hive and it uses
> HiveMetastoreClient
> > > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Best,
> > > > > Qian
> > > >
>

Re: Questions about using Hudi

Posted by Qian Wang <qw...@gmail.com>.

Hi Nishith,

I have checked the data, there is no null in that field. Does there has other possibility about this error?

Thanks,
Qian
On Oct 8, 2019, 10:55 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> Hi Nishith,
>
> Thanks for your response.
> The session_date is one field in my original dataset. I have some questions about the schema parameter:
>
> 1. Do I need create the target table?
> 2. My source data is Parquet format, why the tool need the schema file as the parameter?
> 3. Can I use the schema file of Avro format?
>
> The schema is looks like:
>
> {"type":"record","name":"PathExtractData","doc":"Path event extract fact data”,”fields”:[
>     {“name”:”SESSION_DATE”,”type”:”string”},
>     {“name”:”SITE_ID”,”type”:”int”},
>     {“name”:”GUID”,”type”:”string”},
>     {“name”:”SESSION_KEY”,”type”:”long”},
>     {“name”:”USER_ID”,”type”:”string”},
>     {“name”:”STEP”,”type”:”int”},
>     {“name”:”PAGE_ID”,”type”:”int”}
> ]}
>
> Thanks.
>
> Best,
> Qian
> On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3...@gmail.com>, wrote:
> > Qian,
> >
> > It looks like the partitionPathField that you specified (session_date) is
> > missing or the code is unable to grab it from your payload. Is this field a
> > top-level field or a nested field in your schema ?
> > ( Currently, the HDFSImporterTool looks for your partitionPathField only at
> > the top-level, for example genericRecord.get("session_date") )
> >
> > Thanks,
> > Nishith
> >
> >
> > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Thanks for your response.
> > >
> > > Now I tried to convert existing dataset to Hudi managed dataset and I used
> > > the hdfsparquestimport in hud-cli. I encountered following error:
> > >
> > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > HoodieBloomIndex.java:148, took 2.913761 s
> > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > commit time 20191008095056
> > >
> > > Caused by: org.apache.hudi.exception.HoodieIOException: partition key is
> > > missing. :session_date
> > >
> > > My command in hud-cli as following:
> > > hdfsparquetimport --upsert false --srcPath /path/to/source --targetPath
> > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE --rowKeyField
> > > _row_key --partitionPathField session_date --parallelism 1500
> > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory 6g
> > > --retry 2
> > >
> > > Could you please tell me how to solve this problem? Thanks.
> > >
> > > Best,
> > > Qian
> > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > > Hi,
> > > >
> > > > I have some questions when I try to use Hudi in my company’s prod env:
> > > >
> > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli and
> > > HDFSParquetImporter tool. How can I specify Spark parameters in this tool,
> > > such as Yarn queue, etc?
> > > > 2. Hudi needs to write metadata to Hive and it uses HiveMetastoreClient
> > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > >
> > > > Thanks.
> > > >
> > > > Best,
> > > > Qian
> > >

Re: Questions about using Hudi

Posted by nishith agarwal <n3...@gmail.com>.

Qian,

(1) -> The target table (Hudi table) will be automatically created by the
HDFSImporter tool. You don't need to manually create this.
(2) -> Hudi ingests data based on the AVRO schema provided by clients.
Since the importer tool goes through the same code paths, we require an
AVRO schema to be passed it, which is the latest schema. Ideally, we could
get the schema from parquet files, but since schemas go through an
evolution process, different parquet files may have different schemas.
(3) You can put the AVRO schema in the schema file, the avro schema is
simply in JSON format so as long as you put that in the schema file, it
should work.

I see that the SESSION_DATE field is present in your schema. What can
happen is that some records might not have this field populated and when
that happens, we will not be able to assign a partition-path for that
record and will result in the above exception that you see. Are you sure
that all your existing records in parquet has this field populated (i.e NOT
null) ?

Thanks,
Nishith

On Tue, Oct 8, 2019 at 10:55 AM Qian Wang <qw...@gmail.com> wrote:

> Hi Nishith,
>
> Thanks for your response.
> The session_date is one field in my original dataset. I have some
> questions about the schema parameter:
>
> 1. Do I need create the target table?
> 2. My source data is Parquet format, why the tool need the schema file as
> the parameter?
> 3. Can I use the schema file of Avro format?
>
> The schema is looks like:
>
> {"type":"record","name":"PathExtractData","doc":"Path event extract fact
> data”,”fields”:[
>     {“name”:”SESSION_DATE”,”type”:”string”},
>     {“name”:”SITE_ID”,”type”:”int”},
>     {“name”:”GUID”,”type”:”string”},
>     {“name”:”SESSION_KEY”,”type”:”long”},
>     {“name”:”USER_ID”,”type”:”string”},
>     {“name”:”STEP”,”type”:”int”},
>     {“name”:”PAGE_ID”,”type”:”int”}
> ]}
>
> Thanks.
>
> Best,
> Qian
> On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3...@gmail.com>,
> wrote:
> > Qian,
> >
> > It looks like the partitionPathField that you specified (session_date) is
> > missing or the code is unable to grab it from your payload. Is this
> field a
> > top-level field or a nested field in your schema ?
> > ( Currently, the HDFSImporterTool looks for your partitionPathField only
> at
> > the top-level, for example genericRecord.get("session_date") )
> >
> > Thanks,
> > Nishith
> >
> >
> > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Thanks for your response.
> > >
> > > Now I tried to convert existing dataset to Hudi managed dataset and I
> used
> > > the hdfsparquestimport in hud-cli. I encountered following error:
> > >
> > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > HoodieBloomIndex.java:148, took 2.913761 s
> > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > commit time 20191008095056
> > >
> > > Caused by: org.apache.hudi.exception.HoodieIOException: partition key
> is
> > > missing. :session_date
> > >
> > > My command in hud-cli as following:
> > > hdfsparquetimport --upsert false --srcPath /path/to/source --targetPath
> > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE --rowKeyField
> > > _row_key --partitionPathField session_date --parallelism 1500
> > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory 6g
> > > --retry 2
> > >
> > > Could you please tell me how to solve this problem? Thanks.
> > >
> > > Best,
> > > Qian
> > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > > Hi,
> > > >
> > > > I have some questions when I try to use Hudi in my company’s prod
> env:
> > > >
> > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli and
> > > HDFSParquetImporter tool. How can I specify Spark parameters in this
> tool,
> > > such as Yarn queue, etc?
> > > > 2. Hudi needs to write metadata to Hive and it uses
> HiveMetastoreClient
> > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > >
> > > > Thanks.
> > > >
> > > > Best,
> > > > Qian
> > >
>

Re: Questions about using Hudi

Posted by Qian Wang <qw...@gmail.com>.

Hi Nishith,

Thanks for your response.
The session_date is one field in my original dataset. I have some questions about the schema parameter:

1. Do I need create the target table?
2. My source data is Parquet format, why the tool need the schema file as the parameter?
3. Can I use the schema file of Avro format?

The schema is looks like:

{"type":"record","name":"PathExtractData","doc":"Path event extract fact data”,”fields”:[
    {“name”:”SESSION_DATE”,”type”:”string”},
    {“name”:”SITE_ID”,”type”:”int”},
    {“name”:”GUID”,”type”:”string”},
    {“name”:”SESSION_KEY”,”type”:”long”},
    {“name”:”USER_ID”,”type”:”string”},
    {“name”:”STEP”,”type”:”int”},
    {“name”:”PAGE_ID”,”type”:”int”}
]}

Thanks.

Best,
Qian
On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3...@gmail.com>, wrote:
> Qian,
>
> It looks like the partitionPathField that you specified (session_date) is
> missing or the code is unable to grab it from your payload. Is this field a
> top-level field or a nested field in your schema ?
> ( Currently, the HDFSImporterTool looks for your partitionPathField only at
> the top-level, for example genericRecord.get("session_date") )
>
> Thanks,
> Nishith
>
>
> On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:
>
> > Hi,
> >
> > Thanks for your response.
> >
> > Now I tried to convert existing dataset to Hudi managed dataset and I used
> > the hdfsparquestimport in hud-cli. I encountered following error:
> >
> > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > HoodieBloomIndex.java:148, took 2.913761 s
> > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > commit time 20191008095056
> >
> > Caused by: org.apache.hudi.exception.HoodieIOException: partition key is
> > missing. :session_date
> >
> > My command in hud-cli as following:
> > hdfsparquetimport --upsert false --srcPath /path/to/source --targetPath
> > /path/to/target --tableName xxx --tableType COPY_ON_WRITE --rowKeyField
> > _row_key --partitionPathField session_date --parallelism 1500
> > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory 6g
> > --retry 2
> >
> > Could you please tell me how to solve this problem? Thanks.
> >
> > Best,
> > Qian
> > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > > Hi,
> > >
> > > I have some questions when I try to use Hudi in my company’s prod env:
> > >
> > > 1. When I migrate the history table in HDFS, I tried use hudi-cli and
> > HDFSParquetImporter tool. How can I specify Spark parameters in this tool,
> > such as Yarn queue, etc?
> > > 2. Hudi needs to write metadata to Hive and it uses HiveMetastoreClient
> > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > >
> > > Thanks.
> > >
> > > Best,
> > > Qian
> >

Re: Questions about using Hudi

Posted by nishith agarwal <n3...@gmail.com>.

Qian,

It looks like the partitionPathField that you specified (session_date) is
missing or the code is unable to grab it from your payload. Is this field a
top-level field or a nested field in your schema ?
( Currently, the HDFSImporterTool looks for your partitionPathField only at
the top-level, for example genericRecord.get("session_date") )

Thanks,
Nishith


On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qw...@gmail.com> wrote:

> Hi,
>
> Thanks for your response.
>
> Now I tried to convert existing dataset to Hudi managed dataset and I used
> the hdfsparquestimport in hud-cli. I encountered following error:
>
> 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> HoodieBloomIndex.java:148, took 2.913761 s
> 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> commit time 20191008095056
>
> Caused by: org.apache.hudi.exception.HoodieIOException: partition key is
> missing. :session_date
>
> My command in hud-cli as following:
> hdfsparquetimport --upsert false --srcPath /path/to/source --targetPath
> /path/to/target --tableName xxx --tableType COPY_ON_WRITE --rowKeyField
> _row_key --partitionPathField session_date --parallelism 1500
> --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory 6g
> --retry 2
>
> Could you please tell me how to solve this problem? Thanks.
>
> Best,
> Qian
> On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> > Hi,
> >
> > I have some questions when I try to use Hudi in my company’s prod env:
> >
> > 1. When I migrate the history table in HDFS, I tried use hudi-cli and
> HDFSParquetImporter tool. How can I specify Spark parameters in this tool,
> such as Yarn queue, etc?
> > 2. Hudi needs to write metadata to Hive and it uses HiveMetastoreClient
> and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> >
> > Thanks.
> >
> > Best,
> > Qian
>

Re: Questions about using Hudi

Posted by Qian Wang <qw...@gmail.com>.

Hi,

Thanks for your response.

Now I tried to convert existing dataset to Hudi managed dataset and I used the hdfsparquestimport in hud-cli. I encountered following error:

19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at HoodieBloomIndex.java:148, took 2.913761 s
19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20191008095056

Caused by: org.apache.hudi.exception.HoodieIOException: partition key is missing. :session_date

My command in hud-cli as following:
hdfsparquetimport --upsert false --srcPath /path/to/source --targetPath /path/to/target --tableName xxx --tableType COPY_ON_WRITE --rowKeyField _row_key --partitionPathField session_date --parallelism 1500 --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory 6g --retry 2

Could you please tell me how to solve this problem? Thanks.

Best,
Qian
On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qw...@gmail.com>, wrote:
> Hi,
>
> I have some questions when I try to use Hudi in my company’s prod env:
>
> 1. When I migrate the history table in HDFS, I tried use hudi-cli and HDFSParquetImporter tool. How can I specify Spark parameters in this tool, such as Yarn queue, etc?
> 2. Hudi needs to write metadata to Hive and it uses HiveMetastoreClient and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
>
> Thanks.
>
> Best,
> Qian

Re: Questions about using Hudi

Posted by nishith agarwal <n3...@gmail.com>.

Hi Qian,

Thanks for your questions.

For (1) -> The spark properties right now is picked up from the
SPARK_CONF_DIR, so if you define all these configs in the
spark-defaults.conf file, HDFSParquetImporter will pick it up from there.

For (2) -> The HiveSyncTool now has a flag useJDBC that can let you either
use JDBC or HiveMetastoreClient. If you provide the right URL to connect to
the metastore (essentially provide hive kerberos principal) and set useJDBC
to false, you will be able to talk to the Hive Metastore via the
HiveMetastoreClient.

Thanks,
Nishith

On Sun, Oct 6, 2019 at 9:15 AM Qian Wang <qw...@gmail.com> wrote:

> Hi,
>
> I have some questions when I try to use Hudi in my company’s prod env:
>
> 1. When I migrate the history table in HDFS, I tried use hudi-cli and
> HDFSParquetImporter tool. How can I specify Spark parameters in this tool,
> such as Yarn queue, etc?
> 2. Hudi needs to write metadata to Hive and it uses HiveMetastoreClient
> and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
>
> Thanks.
>
> Best,
> Qian
>