You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Sridivakar (Jira)" <ji...@apache.org> on 2021/02/28 03:19:00 UTC

[jira] [Updated] (GOBBLIN-1395) Spurious PostPublishStep WorkUnits with CREATE TABLE/ADD PARTITION for HIVE table copy

     [ https://issues.apache.org/jira/browse/GOBBLIN-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sridivakar updated GOBBLIN-1395:
--------------------------------
    Summary: Spurious PostPublishStep WorkUnits with CREATE TABLE/ADD PARTITION for HIVE table copy  (was: Spurious PostPublishStep WorkUnits of CREATE TABLE/ADD PARTITIONs for HIVE table copy)

> Spurious PostPublishStep WorkUnits with CREATE TABLE/ADD PARTITION for HIVE table copy
> --------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1395
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1395
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: hive-registration
>    Affects Versions: 0.15.0
>            Reporter: Sridivakar
>            Assignee: Abhishek Tiwari
>            Priority: Minor
>              Labels: Hive, PayPal, Performance
>
> For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations :
>  # Creates so many {{PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps}}, for a source table with P number of partitions.
>  # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions, which is not required though.
> Because of these spurious calls, it impacts the performance. For instance, incremental data movement duration is increasing day by day; as the number of WorkUnits gets increases - older the table is the more time it takes.
> *Steps to reproduce :*
> h5. Step 1)
> a) create a table with 5 partitions, with some rows in each partition
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  Time taken: 1.287 seconds, Fetched: 5 row(s)
> {code}
> b) Do DataMovement with the below mentioned Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O) : 0}}
>  {{Total No. of new partitions in the table (N): 5}}
>  {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
>  {{CopyableFile WorkUnits: 5 (one for each partition)}}
>  {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
> {quote}
> h5. Step 2)
> a) add 5 more partitions, with some rows in each partition 
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  dt=2021-01-01
>  dt=2021-01-02
>  dt=2021-01-03
>  dt=2021-01-04
>  dt=2021-01-05
>  Time taken: 0.131 seconds, Fetched: 10 row(s)
> {code}
>  _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_
> b) Do DataMovement with the below Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 5}}
>  {{Total No. of new partitions in the table (N): 5}}
>  {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
>  {{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
>  {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
> {quote}
> h5. Step 3)
> a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  dt=2020-12-31
>  dt=2021-01-01
>  dt=2021-01-02
>  dt=2021-01-03
>  dt=2021-01-04
>  dt=2021-01-05
>  Time taken: 0.101 seconds, Fetched: 11 row(s)
> {code}
> b) Do DataMovement with the below Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 10}}}}
>  {{Total No. of new partitions in the table (N): 1}}
>  {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
>  {{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
>  {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
> {quote}
>  
>  
> h4. +Job Configuration used:+
> {code:java}
> job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
> job.description=Test Gobblin job for copy
> # target location for copy
> data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
> gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
> source.filebased.fs.uri="hdfs://localhost:8020"
> hive.dataset.hive.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.table.root=${data.publisher.final.dir}
> hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.database=tc_db_copy_1
> hive.db.root.dir=${data.publisher.final.dir}
> # writer.fs.uri="hdfs://127.0.0.1:8020/"
> hive.dataset.whitelist=tc_db.tc_p5_r10
> gobblin.copy.recursive.update=true
> # ====================================================================
> # Distcp configurations (do not change)
> # ====================================================================
> type=hadoopJava
> job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
> extract.namespace=org.apache.gobblin.copy
>  data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
> source.class=org.apache.gobblin.data.management.copy.CopySource
>  writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
> converter.classes=org.apache.gobblin.converter.IdentityConverter
> task.maxretries=0
> workunit.retry.enabled=false
> distcp.persist.dir=/tmp/distcp-persist-dir{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)