You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by SATISH SIDNAKOPPA <sa...@gmail.com> on 2019/04/29 14:19:48 UTC

multi-partitioned hudi table | partitions not created

Hi Team,


I have to store data by department and region.
/dept=HR/region=AP
/dept=OPS/region=AP
/dept=HR/region=SA
/dept=OPS/region=SA

so partitioned table created will have multi-keys


I tried passing value as comma separated(dept,region)
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept,region")

and dot separated,
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept.region")

but the partitions were not created in hdfs.All the data added to default
partition.


Could you guide in format of passing the multi-partitions to spark write
hudi dataset.

regards
Satish S

Re: multi-partitioned hudi table | partitions not created

Posted by Vinoth Chandar <vi...@apache.org>.

I recommend using the HiveSync tool to manage the registration and not do
it manually.
Otherwise, what you see are expected behavior.. part1, part2 will be on the
file, if it was on the data frame

On Mon, Apr 29, 2019 at 11:11 PM SATISH SIDNAKOPPA <
satish.sidnakoppa.it@gmail.com> wrote:

> files in hdfs
>
>
> /apps/hive/warehouse/emp_multi_partkey/part1=A/part2=2018
>
> manaul create table
> CREATE EXTERNAL TABLE `emp_multi_partkey`(
>   `_hoodie_commit_time` string,
>   `_hoodie_commit_seqno` string,
>   `_hoodie_record_key` string,
>   `_hoodie_partition_path` string,
>   `_hoodie_file_name` string,
> emp_id string
> part_col` string)
> PARTITIONED BY (
>   `part1` string,
>   `part2` string)
>
> in dataset these 2 columns exists too
>
> concat('part1=',part1,'/part2=',part2) as part_col
> where part1=A and part2=2018
>
> I am able to update and delete records.Will there be any gap if this
> process in followed?
>
> On Tue, Apr 30, 2019 at 11:36 AM SATISH SIDNAKOPPA <
> satish.sidnakoppa.it@gmail.com> wrote:
>
> > Hi Vinoth,
> >
> > I created the multi_part as below.
> >
> > in dataset ---> concat('part1=',SUBSTR(emp_name,1,1),'/part2=','2018') as
> > part_col
> > in spark.write hud set ------>
> > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"part_col")
> >
> > files in hdfs
> >
> >
> > alter table hudi.emp_multi_partkey add partition(part1='A',part2='2018')
> ;
> >
> >
> >
> >
> > On Mon, Apr 29, 2019 at 8:30 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> >> Hi Satish,
> >>
> >> Thats because the default KeyGenerator class only reads in a single
> field
> >> to partition on. What you are expecting is a composite key.
> >>
> >> Nishith has one in the test suite PR
> >>
> >>
> https://github.com/apache/incubator-hudi/pull/623/files#diff-8814d5eb596f19bc9a87e419453fd7c8
> >>
> >> We plan to add this to the main code. For now, you can copy the class
> and
> >> see if solves your need? KeyGenerator is pluggable anyway
> >>
> >> Thanks
> >> Vinoth
> >>
> >> On Mon, Apr 29, 2019 at 7:20 AM SATISH SIDNAKOPPA <
> >> satish.sidnakoppa.it@gmail.com> wrote:
> >>
> >> > Hi Team,
> >> >
> >> >
> >> > I have to store data by department and region.
> >> > /dept=HR/region=AP
> >> > /dept=OPS/region=AP
> >> > /dept=HR/region=SA
> >> > /dept=OPS/region=SA
> >> >
> >> > so partitioned table created will have multi-keys
> >> >
> >> >
> >> > I tried passing value as comma separated(dept,region)
> >> >
> >>
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept,region")
> >> >
> >> > and dot separated,
> >> >
> >>
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept.region")
> >> >
> >> > but the partitions were not created in hdfs.All the data added to
> >> default
> >> > partition.
> >> >
> >> >
> >> > Could you guide in format of passing the multi-partitions to spark
> write
> >> > hudi dataset.
> >> >
> >> > regards
> >> > Satish S
> >> >
> >>
> >
>

Re: multi-partitioned hudi table | partitions not created

Posted by SATISH SIDNAKOPPA <sa...@gmail.com>.

files in hdfs


/apps/hive/warehouse/emp_multi_partkey/part1=A/part2=2018

manaul create table
CREATE EXTERNAL TABLE `emp_multi_partkey`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
emp_id string
part_col` string)
PARTITIONED BY (
  `part1` string,
  `part2` string)

in dataset these 2 columns exists too

concat('part1=',part1,'/part2=',part2) as part_col
where part1=A and part2=2018

I am able to update and delete records.Will there be any gap if this
process in followed?

On Tue, Apr 30, 2019 at 11:36 AM SATISH SIDNAKOPPA <
satish.sidnakoppa.it@gmail.com> wrote:

> Hi Vinoth,
>
> I created the multi_part as below.
>
> in dataset ---> concat('part1=',SUBSTR(emp_name,1,1),'/part2=','2018') as
> part_col
> in spark.write hud set ------>
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"part_col")
>
> files in hdfs
>
>
> alter table hudi.emp_multi_partkey add partition(part1='A',part2='2018') ;
>
>
>
>
> On Mon, Apr 29, 2019 at 8:30 PM Vinoth Chandar <vi...@apache.org> wrote:
>
>> Hi Satish,
>>
>> Thats because the default KeyGenerator class only reads in a single field
>> to partition on. What you are expecting is a composite key.
>>
>> Nishith has one in the test suite PR
>>
>> https://github.com/apache/incubator-hudi/pull/623/files#diff-8814d5eb596f19bc9a87e419453fd7c8
>>
>> We plan to add this to the main code. For now, you can copy the class and
>> see if solves your need? KeyGenerator is pluggable anyway
>>
>> Thanks
>> Vinoth
>>
>> On Mon, Apr 29, 2019 at 7:20 AM SATISH SIDNAKOPPA <
>> satish.sidnakoppa.it@gmail.com> wrote:
>>
>> > Hi Team,
>> >
>> >
>> > I have to store data by department and region.
>> > /dept=HR/region=AP
>> > /dept=OPS/region=AP
>> > /dept=HR/region=SA
>> > /dept=OPS/region=SA
>> >
>> > so partitioned table created will have multi-keys
>> >
>> >
>> > I tried passing value as comma separated(dept,region)
>> >
>> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept,region")
>> >
>> > and dot separated,
>> >
>> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept.region")
>> >
>> > but the partitions were not created in hdfs.All the data added to
>> default
>> > partition.
>> >
>> >
>> > Could you guide in format of passing the multi-partitions to spark write
>> > hudi dataset.
>> >
>> > regards
>> > Satish S
>> >
>>
>

Re: multi-partitioned hudi table | partitions not created

Posted by SATISH SIDNAKOPPA <sa...@gmail.com>.

Hi Vinoth,

I created the multi_part as below.

in dataset ---> concat('part1=',SUBSTR(emp_name,1,1),'/part2=','2018') as
part_col
in spark.write hud set ------>
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"part_col")

files in hdfs


alter table hudi.emp_multi_partkey add partition(part1='A',part2='2018') ;




On Mon, Apr 29, 2019 at 8:30 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Satish,
>
> Thats because the default KeyGenerator class only reads in a single field
> to partition on. What you are expecting is a composite key.
>
> Nishith has one in the test suite PR
>
> https://github.com/apache/incubator-hudi/pull/623/files#diff-8814d5eb596f19bc9a87e419453fd7c8
>
> We plan to add this to the main code. For now, you can copy the class and
> see if solves your need? KeyGenerator is pluggable anyway
>
> Thanks
> Vinoth
>
> On Mon, Apr 29, 2019 at 7:20 AM SATISH SIDNAKOPPA <
> satish.sidnakoppa.it@gmail.com> wrote:
>
> > Hi Team,
> >
> >
> > I have to store data by department and region.
> > /dept=HR/region=AP
> > /dept=OPS/region=AP
> > /dept=HR/region=SA
> > /dept=OPS/region=SA
> >
> > so partitioned table created will have multi-keys
> >
> >
> > I tried passing value as comma separated(dept,region)
> > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept,region")
> >
> > and dot separated,
> > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept.region")
> >
> > but the partitions were not created in hdfs.All the data added to default
> > partition.
> >
> >
> > Could you guide in format of passing the multi-partitions to spark write
> > hudi dataset.
> >
> > regards
> > Satish S
> >
>

Re: multi-partitioned hudi table | partitions not created

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Satish,

Thats because the default KeyGenerator class only reads in a single field
to partition on. What you are expecting is a composite key.

Nishith has one in the test suite PR
https://github.com/apache/incubator-hudi/pull/623/files#diff-8814d5eb596f19bc9a87e419453fd7c8

We plan to add this to the main code. For now, you can copy the class and
see if solves your need? KeyGenerator is pluggable anyway

Thanks
Vinoth

On Mon, Apr 29, 2019 at 7:20 AM SATISH SIDNAKOPPA <
satish.sidnakoppa.it@gmail.com> wrote:

> Hi Team,
>
>
> I have to store data by department and region.
> /dept=HR/region=AP
> /dept=OPS/region=AP
> /dept=HR/region=SA
> /dept=OPS/region=SA
>
> so partitioned table created will have multi-keys
>
>
> I tried passing value as comma separated(dept,region)
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept,region")
>
> and dot separated,
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"dept.region")
>
> but the partitions were not created in hdfs.All the data added to default
> partition.
>
>
> Could you guide in format of passing the multi-partitions to spark write
> hudi dataset.
>
> regards
> Satish S
>