You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Purushotham Pushpavanthar <pu...@gmail.com> on 2020/01/17 11:43:42 UTC

updatePartitionsToTable() is time consuming and redundant.

Hi,

I noticed that
*org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time
consuming while running HUDI on set of records which contains data for
large set of partitions. All it is doing is setting location for each
updated partition path. However,
*org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
*is taking care of adding new partitions to the table.

   1. For a given table, whose base path doesn't change (usually it doesn't
   in production), why *updatePartitionsToTable() *is needed? Can you
   please throw some light on any such case where this is needed?
   2. If it is required, can we do something to optimise the time consumed
   by this operation? Currently, the *Alter Statements* are executed one by
   one on each (partition, path) pair for every updated partition.



Regards,
Purushotham Pushpavanth

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by "vbalaji@apache.org" <vb...@apache.org>.
 
Resurrecting this old thread and adding Udit.
Udit, 
I am not able to reproduce this issue with HDFS. Are you seeing this pattern where there are redundant alter-partitions call. 
Although not related, I was looking into https://jira.apache.org/jira/browse/HUDI-325 and am wondering if we are seeing any discrepancies in hive-syncing between HDFS and non-HDFS clusters.
Balaji.V 
    On Wednesday, February 19, 2020, 11:36:08 AM PST, vbalaji@apache.org <vb...@apache.org> wrote:  
 
 
Hi Pratyaksh/Purushotham,
I spent some time in the morning trying to reproduce this locally but unable to. There is an unit-test TestHiveSyncTool.testSyncIncremental which is quite close to the setup we need to repro. 
I added the below check and it passed (meaning works as expected with no unnecessary update partitions call). Can you use the below code to try reproducing it locally and in the real ecosystem to see what is happening.
Balaji.V
```System.out.println("DUPLICATE CHECK");
String commitTime3 = "102";
TestUtil.addCOWPartitions(1, true, dateTime, commitTime3);
hiveClient = new HoodieHiveClient(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(), TestUtil.fileSystem);
writtenPartitionsSince = hiveClient.getPartitionsWrittenToSince(Option.of(commitTime2));
System.out.println("Added Partitions :" + writtenPartitionsSince);
assertEquals(1, writtenPartitionsSince.size());
hivePartitions = hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName);
partitionEvents = hiveClient.getPartitionEvents(hivePartitions, writtenPartitionsSince);
assertEquals("No partition events", 0, partitionEvents.size());

tool = new HiveSyncTool(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(), TestUtil.fileSystem);
tool.syncHoodieTable();
// Sync should add the one partition
assertEquals(6, hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
assertEquals("The last commit that was sycned should be 102", commitTime3,
    hiveClient.getLastCommitTimeSynced(TestUtil.hiveSyncConfig.tableName).get());````    On Wednesday, February 19, 2020, 04:08:39 AM PST, Pratyaksh Sharma <pr...@gmail.com> wrote:  
 
 Hi Balaji,

We are using Hadoop 3.1.0.

Here is the output of the function you wanted to see -

Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Stripped path does not contain scheme and authority.

On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and  lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you  check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
>    Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
>        + "=20191117\n");
>    System.out.println("Path is : " + path.toUri().getPath());
>    System.out.println("Is Absolute :" + path.isUriPathAbsolute());
>    String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
>    System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
>    On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar <pu...@gmail.com> wrote:
>
>  Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191127, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191006, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191009, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191129, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191008, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191120, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191122, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191001, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191121, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191124, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191003, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191002, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191123, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191005, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191126, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191125, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191004, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181208, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181207, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181206, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181205, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181209, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181204, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181203, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181202, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181201, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
>
> Regards,
> Purushotham Pushpavanth
>
>
>
> On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <vi...@apache.org> wrote:
>
> > Unfortunately, the mailing list does not support attachments, looks like
> :(
> > Could you paste it inline?
> >
> > On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> > pushpavanthar@gmail.com> wrote:
> >
> > > Hi Balaji,
> > >
> > > The attachment contains the logs you asked for.
> > > However, the only difference between storageValue and
> > > fullStoragePartitionPath is *target-base-path*.
> > > So if I'm not wrong, the code will be marking all partitions which got
> > > UPDATE data for partition update. Hence time consuming.
> > >
> > > Regards,
> > > Purushotham Pushpavanth
> > >
> > >
> > >
> > > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > > <v....@ymail.com.invalid> wrote:
> > >
> > >>  Hi Purushotham,
> > >> I am unable to reproduce same  partitions getting hive-synced locally.
> > >> Can you add the following log message in HoodieHiveClient.java and run
> > the
> > >> code and send us logs.
> > >> diff --git
> > >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> index 4578bb2f..ba4b1147 100644
> > >>
> > >> ---
> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> +++
> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> > >>
> > >>          if (!paths.containsKey(storageValue)) {
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> > >>
> > >>          } else if
> > >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> > >>
> > >> +          LOG.info("Partition Location changes. StorageVal=" +
> > >> storageValue
> > >>
> > >> +              + ", Existing Hive Path=" + paths.get(storageValue) +
> ",
> > >> New Location=" + fullStoragePartitionPath);
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> > >>
> > >>          }
> > >>
> > >>        }
> > >>
> > >> THanks,Balaji.V
> > >>    On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> > >> Pushpavanthar <pu...@gmail.com> wrote:
> > >>
> > >>  Hi,
> > >>
> > >> I noticed that
> > >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> > time
> > >> consuming while running HUDI on set of records which contains data for
> > >> large set of partitions. All it is doing is setting location for each
> > >> updated partition path. However,
> > >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> > >> *is taking care of adding new partitions to the table.
> > >>
> > >>  1. For a given table, whose base path doesn't change (usually it
> > doesn't
> > >>  in production), why *updatePartitionsToTable() *is needed? Can you
> > >>  please throw some light on any such case where this is needed?
> > >>  2. If it is required, can we do something to optimise the time
> > consumed
> > >>  by this operation? Currently, the *Alter Statements* are executed one
> > by
> > >>  one on each (partition, path) pair for every updated partition.
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Purushotham Pushpavanth
> > >>
> > >
> > >
> >
>
    

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by "vbalaji@apache.org" <vb...@apache.org>.
 
Hi Pratyaksh/Purushotham,
I spent some time in the morning trying to reproduce this locally but unable to. There is an unit-test TestHiveSyncTool.testSyncIncremental which is quite close to the setup we need to repro. 
I added the below check and it passed (meaning works as expected with no unnecessary update partitions call). Can you use the below code to try reproducing it locally and in the real ecosystem to see what is happening.
Balaji.V
```System.out.println("DUPLICATE CHECK");
String commitTime3 = "102";
TestUtil.addCOWPartitions(1, true, dateTime, commitTime3);
hiveClient = new HoodieHiveClient(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(), TestUtil.fileSystem);
writtenPartitionsSince = hiveClient.getPartitionsWrittenToSince(Option.of(commitTime2));
System.out.println("Added Partitions :" + writtenPartitionsSince);
assertEquals(1, writtenPartitionsSince.size());
hivePartitions = hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName);
partitionEvents = hiveClient.getPartitionEvents(hivePartitions, writtenPartitionsSince);
assertEquals("No partition events", 0, partitionEvents.size());

tool = new HiveSyncTool(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(), TestUtil.fileSystem);
tool.syncHoodieTable();
// Sync should add the one partition
assertEquals(6, hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
assertEquals("The last commit that was sycned should be 102", commitTime3,
    hiveClient.getLastCommitTimeSynced(TestUtil.hiveSyncConfig.tableName).get());````    On Wednesday, February 19, 2020, 04:08:39 AM PST, Pratyaksh Sharma <pr...@gmail.com> wrote:  
 
 Hi Balaji,

We are using Hadoop 3.1.0.

Here is the output of the function you wanted to see -

Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Stripped path does not contain scheme and authority.

On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and  lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you  check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
>    Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
>        + "=20191117\n");
>    System.out.println("Path is : " + path.toUri().getPath());
>    System.out.println("Is Absolute :" + path.isUriPathAbsolute());
>    String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
>    System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
>    On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar <pu...@gmail.com> wrote:
>
>  Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191127, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191006, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191009, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191129, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191008, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191120, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191122, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191001, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191121, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191124, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191003, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191002, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191123, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191005, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191126, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191125, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191004, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181208, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181207, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181206, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181205, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181209, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181204, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181203, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181202, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181201, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
>
> Regards,
> Purushotham Pushpavanth
>
>
>
> On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <vi...@apache.org> wrote:
>
> > Unfortunately, the mailing list does not support attachments, looks like
> :(
> > Could you paste it inline?
> >
> > On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> > pushpavanthar@gmail.com> wrote:
> >
> > > Hi Balaji,
> > >
> > > The attachment contains the logs you asked for.
> > > However, the only difference between storageValue and
> > > fullStoragePartitionPath is *target-base-path*.
> > > So if I'm not wrong, the code will be marking all partitions which got
> > > UPDATE data for partition update. Hence time consuming.
> > >
> > > Regards,
> > > Purushotham Pushpavanth
> > >
> > >
> > >
> > > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > > <v....@ymail.com.invalid> wrote:
> > >
> > >>  Hi Purushotham,
> > >> I am unable to reproduce same  partitions getting hive-synced locally.
> > >> Can you add the following log message in HoodieHiveClient.java and run
> > the
> > >> code and send us logs.
> > >> diff --git
> > >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> index 4578bb2f..ba4b1147 100644
> > >>
> > >> ---
> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> +++
> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> > >>
> > >>          if (!paths.containsKey(storageValue)) {
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> > >>
> > >>          } else if
> > >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> > >>
> > >> +          LOG.info("Partition Location changes. StorageVal=" +
> > >> storageValue
> > >>
> > >> +              + ", Existing Hive Path=" + paths.get(storageValue) +
> ",
> > >> New Location=" + fullStoragePartitionPath);
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> > >>
> > >>          }
> > >>
> > >>        }
> > >>
> > >> THanks,Balaji.V
> > >>    On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> > >> Pushpavanthar <pu...@gmail.com> wrote:
> > >>
> > >>  Hi,
> > >>
> > >> I noticed that
> > >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> > time
> > >> consuming while running HUDI on set of records which contains data for
> > >> large set of partitions. All it is doing is setting location for each
> > >> updated partition path. However,
> > >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> > >> *is taking care of adding new partitions to the table.
> > >>
> > >>  1. For a given table, whose base path doesn't change (usually it
> > doesn't
> > >>  in production), why *updatePartitionsToTable() *is needed? Can you
> > >>  please throw some light on any such case where this is needed?
> > >>  2. If it is required, can we do something to optimise the time
> > consumed
> > >>  by this operation? Currently, the *Alter Statements* are executed one
> > by
> > >>  one on each (partition, path) pair for every updated partition.
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Purushotham Pushpavanth
> > >>
> > >
> > >
> >
>
  

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by Pratyaksh Sharma <pr...@gmail.com>.
Hi Balaji,

We are using Hadoop 3.1.0.

Here is the output of the function you wanted to see -

Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Stripped path does not contain scheme and authority.

On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and  lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you  check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
>     Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
>         + "=20191117\n");
>     System.out.println("Path is : " + path.toUri().getPath());
>     System.out.println("Is Absolute :" + path.isUriPathAbsolute());
>     String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
>     System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
>     On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar <pu...@gmail.com> wrote:
>
>  Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191127, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191006, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191009, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191129, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191008, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191120, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191122, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191001, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191121, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191124, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191003, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191002, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191123, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191005, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191126, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191125, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191004, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181208, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181207, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181206, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181205, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181209, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181204, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181203, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181202, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181201, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
>
> Regards,
> Purushotham Pushpavanth
>
>
>
> On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <vi...@apache.org> wrote:
>
> > Unfortunately, the mailing list does not support attachments, looks like
> :(
> > Could you paste it inline?
> >
> > On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> > pushpavanthar@gmail.com> wrote:
> >
> > > Hi Balaji,
> > >
> > > The attachment contains the logs you asked for.
> > > However, the only difference between storageValue and
> > > fullStoragePartitionPath is *target-base-path*.
> > > So if I'm not wrong, the code will be marking all partitions which got
> > > UPDATE data for partition update. Hence time consuming.
> > >
> > > Regards,
> > > Purushotham Pushpavanth
> > >
> > >
> > >
> > > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > > <v....@ymail.com.invalid> wrote:
> > >
> > >>  Hi Purushotham,
> > >> I am unable to reproduce same  partitions getting hive-synced locally.
> > >> Can you add the following log message in HoodieHiveClient.java and run
> > the
> > >> code and send us logs.
> > >> diff --git
> > >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> index 4578bb2f..ba4b1147 100644
> > >>
> > >> ---
> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> +++
> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> > >>
> > >>          if (!paths.containsKey(storageValue)) {
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> > >>
> > >>          } else if
> > >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> > >>
> > >> +          LOG.info("Partition Location changes. StorageVal=" +
> > >> storageValue
> > >>
> > >> +              + ", Existing Hive Path=" + paths.get(storageValue) +
> ",
> > >> New Location=" + fullStoragePartitionPath);
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> > >>
> > >>          }
> > >>
> > >>        }
> > >>
> > >> THanks,Balaji.V
> > >>    On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> > >> Pushpavanthar <pu...@gmail.com> wrote:
> > >>
> > >>  Hi,
> > >>
> > >> I noticed that
> > >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> > time
> > >> consuming while running HUDI on set of records which contains data for
> > >> large set of partitions. All it is doing is setting location for each
> > >> updated partition path. However,
> > >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> > >> *is taking care of adding new partitions to the table.
> > >>
> > >>  1. For a given table, whose base path doesn't change (usually it
> > doesn't
> > >>  in production), why *updatePartitionsToTable() *is needed? Can you
> > >>  please throw some light on any such case where this is needed?
> > >>  2. If it is required, can we do something to optimise the time
> > consumed
> > >>  by this operation? Currently, the *Alter Statements* are executed one
> > by
> > >>  one on each (partition, path) pair for every updated partition.
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Purushotham Pushpavanth
> > >>
> > >
> > >
> >
>

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 
Sorry for the delay. From the logs, it is clear that the stored partition key and  lookup key are not exactly same. One has scheme and authority in its URI while the other is not. This is the reason why we are updating the same partition again.
Some of the methods used here comes from hadoop-common and related packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally. I used the below code to try to repro. Which version of Hadoop are you using in runtime. Can you  check if the stripped path (see test code below) still contains scheme and authority. 

```public void testit() {
    Path path = new Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
        + "=20191117\n");
    System.out.println("Path is : " + path.toUri().getPath());
    System.out.println("Is Absolute :" + path.isUriPathAbsolute());
    String stripped = Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
    System.out.println("Stripped Path =" + stripped);
}
```
Balaji.V


    On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham Pushpavanthar <pu...@gmail.com> wrote:  
 
 Hi Balaji/Vinoth,

Below is the log we obtained from Hudi.

20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
be 20200122094611
20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
20200122094611, Getting commits since then
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180108, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180221, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180102, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191007, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191128, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191127, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191006, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191009, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191129, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191008, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191120, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191122, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191001, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191121, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191124, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191003, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191002, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191123, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191005, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191126, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191125, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191004, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181208, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181207, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181206, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181205, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180117, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181209, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181204, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181203, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181202, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181201, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191117, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Regards,
Purushotham Pushpavanth



On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <vi...@apache.org> wrote:

> Unfortunately, the mailing list does not support attachments, looks like :(
> Could you paste it inline?
>
> On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> pushpavanthar@gmail.com> wrote:
>
> > Hi Balaji,
> >
> > The attachment contains the logs you asked for.
> > However, the only difference between storageValue and
> > fullStoragePartitionPath is *target-base-path*.
> > So if I'm not wrong, the code will be marking all partitions which got
> > UPDATE data for partition update. Hence time consuming.
> >
> > Regards,
> > Purushotham Pushpavanth
> >
> >
> >
> > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > <v....@ymail.com.invalid> wrote:
> >
> >>  Hi Purushotham,
> >> I am unable to reproduce same  partitions getting hive-synced locally.
> >> Can you add the following log message in HoodieHiveClient.java and run
> the
> >> code and send us logs.
> >> diff --git
> >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> index 4578bb2f..ba4b1147 100644
> >>
> >> --- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> +++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> >>
> >>          if (!paths.containsKey(storageValue)) {
> >>
> >>
> >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> >>
> >>          } else if
> >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> >>
> >> +          LOG.info("Partition Location changes. StorageVal=" +
> >> storageValue
> >>
> >> +              + ", Existing Hive Path=" + paths.get(storageValue) + ",
> >> New Location=" + fullStoragePartitionPath);
> >>
> >>
> >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> >>
> >>          }
> >>
> >>        }
> >>
> >> THanks,Balaji.V
> >>    On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> >> Pushpavanthar <pu...@gmail.com> wrote:
> >>
> >>  Hi,
> >>
> >> I noticed that
> >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> time
> >> consuming while running HUDI on set of records which contains data for
> >> large set of partitions. All it is doing is setting location for each
> >> updated partition path. However,
> >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> >> *is taking care of adding new partitions to the table.
> >>
> >>  1. For a given table, whose base path doesn't change (usually it
> doesn't
> >>  in production), why *updatePartitionsToTable() *is needed? Can you
> >>  please throw some light on any such case where this is needed?
> >>  2. If it is required, can we do something to optimise the time
> consumed
> >>  by this operation? Currently, the *Alter Statements* are executed one
> by
> >>  one on each (partition, path) pair for every updated partition.
> >>
> >>
> >>
> >> Regards,
> >> Purushotham Pushpavanth
> >>
> >
> >
>
  

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by Vinoth Chandar <vi...@apache.org>.
Really apologize. This slipped through the cracks.

Will take a crack at it later today and circle back.

On Wed, Feb 5, 2020 at 12:53 AM Purushotham Pushpavanthar <
pushpavanthar@gmail.com> wrote:

> Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191127, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191006, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191009, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191129, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191008, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191120, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191122, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191001, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191121, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191124, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191003, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191002, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191123, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191005, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191126, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191125, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191004, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181208, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181207, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181206, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181205, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181209, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181204, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181203, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181202, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181201, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
>
> Regards,
> Purushotham Pushpavanth
>
>
>
> On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <vi...@apache.org> wrote:
>
> > Unfortunately, the mailing list does not support attachments, looks like
> :(
> > Could you paste it inline?
> >
> > On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> > pushpavanthar@gmail.com> wrote:
> >
> > > Hi Balaji,
> > >
> > > The attachment contains the logs you asked for.
> > > However, the only difference between storageValue and
> > > fullStoragePartitionPath is *target-base-path*.
> > > So if I'm not wrong, the code will be marking all partitions which got
> > > UPDATE data for partition update. Hence time consuming.
> > >
> > > Regards,
> > > Purushotham Pushpavanth
> > >
> > >
> > >
> > > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > > <v....@ymail.com.invalid> wrote:
> > >
> > >>  Hi Purushotham,
> > >> I am unable to reproduce same  partitions getting hive-synced locally.
> > >> Can you add the following log message in HoodieHiveClient.java and run
> > the
> > >> code and send us logs.
> > >> diff --git
> > >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> index 4578bb2f..ba4b1147 100644
> > >>
> > >> ---
> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> +++
> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> > >>
> > >>          if (!paths.containsKey(storageValue)) {
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> > >>
> > >>          } else if
> > >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> > >>
> > >> +          LOG.info("Partition Location changes. StorageVal=" +
> > >> storageValue
> > >>
> > >> +              + ", Existing Hive Path=" + paths.get(storageValue) +
> ",
> > >> New Location=" + fullStoragePartitionPath);
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> > >>
> > >>          }
> > >>
> > >>        }
> > >>
> > >> THanks,Balaji.V
> > >>     On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> > >> Pushpavanthar <pu...@gmail.com> wrote:
> > >>
> > >>  Hi,
> > >>
> > >> I noticed that
> > >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> > time
> > >> consuming while running HUDI on set of records which contains data for
> > >> large set of partitions. All it is doing is setting location for each
> > >> updated partition path. However,
> > >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> > >> *is taking care of adding new partitions to the table.
> > >>
> > >>   1. For a given table, whose base path doesn't change (usually it
> > doesn't
> > >>   in production), why *updatePartitionsToTable() *is needed? Can you
> > >>   please throw some light on any such case where this is needed?
> > >>   2. If it is required, can we do something to optimise the time
> > consumed
> > >>   by this operation? Currently, the *Alter Statements* are executed
> one
> > by
> > >>   one on each (partition, path) pair for every updated partition.
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Purushotham Pushpavanth
> > >>
> > >
> > >
> >
>

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by Purushotham Pushpavanthar <pu...@gmail.com>.
Hi Balaji/Vinoth,

Below is the log we obtained from Hudi.

20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
be 20200122094611
20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
20200122094611, Getting commits since then
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180108, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180221, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180102, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191007, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191128, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191127, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191006, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191009, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191129, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191008, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191120, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191122, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191001, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191121, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191124, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191003, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191002, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191123, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191005, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191126, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191125, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191004, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181208, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181207, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181206, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181205, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180117, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181209, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181204, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181203, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181202, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181201, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191117, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Regards,
Purushotham Pushpavanth



On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <vi...@apache.org> wrote:

> Unfortunately, the mailing list does not support attachments, looks like :(
> Could you paste it inline?
>
> On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> pushpavanthar@gmail.com> wrote:
>
> > Hi Balaji,
> >
> > The attachment contains the logs you asked for.
> > However, the only difference between storageValue and
> > fullStoragePartitionPath is *target-base-path*.
> > So if I'm not wrong, the code will be marking all partitions which got
> > UPDATE data for partition update. Hence time consuming.
> >
> > Regards,
> > Purushotham Pushpavanth
> >
> >
> >
> > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > <v....@ymail.com.invalid> wrote:
> >
> >>  Hi Purushotham,
> >> I am unable to reproduce same  partitions getting hive-synced locally.
> >> Can you add the following log message in HoodieHiveClient.java and run
> the
> >> code and send us logs.
> >> diff --git
> >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> index 4578bb2f..ba4b1147 100644
> >>
> >> --- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> +++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> >>
> >>          if (!paths.containsKey(storageValue)) {
> >>
> >>
> >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> >>
> >>          } else if
> >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> >>
> >> +          LOG.info("Partition Location changes. StorageVal=" +
> >> storageValue
> >>
> >> +              + ", Existing Hive Path=" + paths.get(storageValue) + ",
> >> New Location=" + fullStoragePartitionPath);
> >>
> >>
> >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> >>
> >>          }
> >>
> >>        }
> >>
> >> THanks,Balaji.V
> >>     On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> >> Pushpavanthar <pu...@gmail.com> wrote:
> >>
> >>  Hi,
> >>
> >> I noticed that
> >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> time
> >> consuming while running HUDI on set of records which contains data for
> >> large set of partitions. All it is doing is setting location for each
> >> updated partition path. However,
> >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> >> *is taking care of adding new partitions to the table.
> >>
> >>   1. For a given table, whose base path doesn't change (usually it
> doesn't
> >>   in production), why *updatePartitionsToTable() *is needed? Can you
> >>   please throw some light on any such case where this is needed?
> >>   2. If it is required, can we do something to optimise the time
> consumed
> >>   by this operation? Currently, the *Alter Statements* are executed one
> by
> >>   one on each (partition, path) pair for every updated partition.
> >>
> >>
> >>
> >> Regards,
> >> Purushotham Pushpavanth
> >>
> >
> >
>

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by Vinoth Chandar <vi...@apache.org>.
Unfortunately, the mailing list does not support attachments, looks like :(
Could you paste it inline?

On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
pushpavanthar@gmail.com> wrote:

> Hi Balaji,
>
> The attachment contains the logs you asked for.
> However, the only difference between storageValue and
> fullStoragePartitionPath is *target-base-path*.
> So if I'm not wrong, the code will be marking all partitions which got
> UPDATE data for partition update. Hence time consuming.
>
> Regards,
> Purushotham Pushpavanth
>
>
>
> On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> <v....@ymail.com.invalid> wrote:
>
>>  Hi Purushotham,
>> I am unable to reproduce same  partitions getting hive-synced locally.
>> Can you add the following log message in HoodieHiveClient.java and run the
>> code and send us logs.
>> diff --git
>> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>>
>> index 4578bb2f..ba4b1147 100644
>>
>> --- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>>
>> +++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>>
>> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
>>
>>          if (!paths.containsKey(storageValue)) {
>>
>>
>> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
>>
>>          } else if
>> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
>>
>> +          LOG.info("Partition Location changes. StorageVal=" +
>> storageValue
>>
>> +              + ", Existing Hive Path=" + paths.get(storageValue) + ",
>> New Location=" + fullStoragePartitionPath);
>>
>>
>> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
>>
>>          }
>>
>>        }
>>
>> THanks,Balaji.V
>>     On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
>> Pushpavanthar <pu...@gmail.com> wrote:
>>
>>  Hi,
>>
>> I noticed that
>> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time
>> consuming while running HUDI on set of records which contains data for
>> large set of partitions. All it is doing is setting location for each
>> updated partition path. However,
>> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
>> *is taking care of adding new partitions to the table.
>>
>>   1. For a given table, whose base path doesn't change (usually it doesn't
>>   in production), why *updatePartitionsToTable() *is needed? Can you
>>   please throw some light on any such case where this is needed?
>>   2. If it is required, can we do something to optimise the time consumed
>>   by this operation? Currently, the *Alter Statements* are executed one by
>>   one on each (partition, path) pair for every updated partition.
>>
>>
>>
>> Regards,
>> Purushotham Pushpavanth
>>
>
>

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by Purushotham Pushpavanthar <pu...@gmail.com>.
Hi Balaji,

The attachment contains the logs you asked for.
However, the only difference between storageValue and
fullStoragePartitionPath is *target-base-path*.
So if I'm not wrong, the code will be marking all partitions which got
UPDATE data for partition update. Hence time consuming.

Regards,
Purushotham Pushpavanth



On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan <v....@ymail.com.invalid>
wrote:

>  Hi Purushotham,
> I am unable to reproduce same  partitions getting hive-synced locally. Can
> you add the following log message in HoodieHiveClient.java and run the code
> and send us logs.
> diff --git
> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>
> index 4578bb2f..ba4b1147 100644
>
> --- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>
> +++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>
> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
>
>          if (!paths.containsKey(storageValue)) {
>
>
> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
>
>          } else if
> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
>
> +          LOG.info("Partition Location changes. StorageVal=" +
> storageValue
>
> +              + ", Existing Hive Path=" + paths.get(storageValue) + ",
> New Location=" + fullStoragePartitionPath);
>
>
> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
>
>          }
>
>        }
>
> THanks,Balaji.V
>     On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> Pushpavanthar <pu...@gmail.com> wrote:
>
>  Hi,
>
> I noticed that
> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time
> consuming while running HUDI on set of records which contains data for
> large set of partitions. All it is doing is setting location for each
> updated partition path. However,
> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> *is taking care of adding new partitions to the table.
>
>   1. For a given table, whose base path doesn't change (usually it doesn't
>   in production), why *updatePartitionsToTable() *is needed? Can you
>   please throw some light on any such case where this is needed?
>   2. If it is required, can we do something to optimise the time consumed
>   by this operation? Currently, the *Alter Statements* are executed one by
>   one on each (partition, path) pair for every updated partition.
>
>
>
> Regards,
> Purushotham Pushpavanth
>

Re: updatePartitionsToTable() is time consuming and redundant.

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Hi Purushotham,
I am unable to reproduce same  partitions getting hive-synced locally. Can you add the following log message in HoodieHiveClient.java and run the code and send us logs.
diff --git a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

index 4578bb2f..ba4b1147 100644

--- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

+++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

@@ -237,6 +237,8 @@ public class HoodieHiveClient {

         if (!paths.containsKey(storageValue)) {

           events.add(PartitionEvent.newPartitionAddEvent(storagePartition));

         } else if (!paths.get(storageValue).equals(fullStoragePartitionPath)) {

+          LOG.info("Partition Location changes. StorageVal=" + storageValue

+              + ", Existing Hive Path=" + paths.get(storageValue) + ", New Location=" + fullStoragePartitionPath);

           events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));

         }

       }

THanks,Balaji.V
    On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham Pushpavanthar <pu...@gmail.com> wrote:  
 
 Hi,

I noticed that
*org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time
consuming while running HUDI on set of records which contains data for
large set of partitions. All it is doing is setting location for each
updated partition path. However,
*org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
*is taking care of adding new partitions to the table.

  1. For a given table, whose base path doesn't change (usually it doesn't
  in production), why *updatePartitionsToTable() *is needed? Can you
  please throw some light on any such case where this is needed?
  2. If it is required, can we do something to optimise the time consumed
  by this operation? Currently, the *Alter Statements* are executed one by
  one on each (partition, path) pair for every updated partition.



Regards,
Purushotham Pushpavanth