You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Sridivakar (Jira)" <ji...@apache.org> on 2021/08/10 13:57:00 UTC
[jira] [Updated] (GOBBLIN-1395) Spurious PostPublishStep WorkUnits with CREATE TABLE/ADD PARTITION for HIVE table copy

     [ https://issues.apache.org/jira/browse/GOBBLIN-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sridivakar updated GOBBLIN-1395:
--------------------------------
    Description: 
For Hive Table incremental copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations :
 # Creates many {{PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps with CREATE TABLE}}, for a source table with P number of total partitions, irrespective of partitions to be moved during the increment.
 # Also creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not create any CopyableFile work units for those partitions.

Because of these spurious calls, it impacts the performance. For instance, incremental data movement duration is increasing day by day; as the number of WorkUnits gets increases.

*Steps to reproduce :*
h5. Step 1)

a) create a table with 5 partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 Time taken: 1.287 seconds, Fetched: 5 row(s)
{code}
b) Do DataMovement with the below mentioned Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O) : 0}}
 {{Total No. of new partitions in the table (N): 5}}
 {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
 {{CopyableFile WorkUnits: 5 (one for each partition)}}
 {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
{quote}
h5. Step 2)

a) add 5 more partitions, with some rows in each partition 
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.131 seconds, Fetched: 10 row(s)
{code}
 _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_

b) Do DataMovement with the below Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O): 5}}
 {{Total No. of new partitions in the table (N): 5}}
 {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
 {{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
 {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
{quote}
h5. Step 3)

a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2020-12-31
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.101 seconds, Fetched: 11 row(s)
{code}
b) Do DataMovement with the below Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O): 10}}}}
 {{Total No. of new partitions in the table (N): 1}}
 {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
 {{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
 {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
{quote}
 

 
h4. +Job Configuration used:+
{code:java}
job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
job.description=Test Gobblin job for copy
# target location for copy
data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
source.filebased.fs.uri="hdfs://localhost:8020"
hive.dataset.hive.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.table.root=${data.publisher.final.dir}
hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.database=tc_db_copy_1
hive.db.root.dir=${data.publisher.final.dir}
# writer.fs.uri="hdfs://127.0.0.1:8020/"
hive.dataset.whitelist=tc_db.tc_p5_r10
gobblin.copy.recursive.update=true
# ====================================================================
# Distcp configurations (do not change)
# ====================================================================
type=hadoopJava
job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
extract.namespace=org.apache.gobblin.copy
 data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
source.class=org.apache.gobblin.data.management.copy.CopySource
 writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
converter.classes=org.apache.gobblin.converter.IdentityConverter
task.maxretries=0
workunit.retry.enabled=false
distcp.persist.dir=/tmp/distcp-persist-dir{code}

  was:
For Hive Table incremental copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations :
 # Creates many {{PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps with CREATE TABLE}}, for a source table with P number of total partitions, irrespective of partitions to be moved during the increment.
 # Also creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not create any CopyableFile work units for those partitions.

Because of these spurious calls, it impacts the performance. For instance, incremental data movement duration is increasing day by day; as the number of WorkUnits gets increases - the older the table is the more time it takes for incremental copy.

*Steps to reproduce :*
h5. Step 1)

a) create a table with 5 partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 Time taken: 1.287 seconds, Fetched: 5 row(s)
{code}
b) Do DataMovement with the below mentioned Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O) : 0}}
 {{Total No. of new partitions in the table (N): 5}}
 {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
 {{CopyableFile WorkUnits: 5 (one for each partition)}}
 {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
{quote}
h5. Step 2)

a) add 5 more partitions, with some rows in each partition 
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.131 seconds, Fetched: 10 row(s)
{code}
 _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_

b) Do DataMovement with the below Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O): 5}}
 {{Total No. of new partitions in the table (N): 5}}
 {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
 {{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
 {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
{quote}
h5. Step 3)

a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2020-12-31
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.101 seconds, Fetched: 11 row(s)
{code}
b) Do DataMovement with the below Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O): 10}}}}
 {{Total No. of new partitions in the table (N): 1}}
 {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
 {{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
 {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
{quote}
 

 
h4. +Job Configuration used:+
{code:java}
job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
job.description=Test Gobblin job for copy
# target location for copy
data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
source.filebased.fs.uri="hdfs://localhost:8020"
hive.dataset.hive.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.table.root=${data.publisher.final.dir}
hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.database=tc_db_copy_1
hive.db.root.dir=${data.publisher.final.dir}
# writer.fs.uri="hdfs://127.0.0.1:8020/"
hive.dataset.whitelist=tc_db.tc_p5_r10
gobblin.copy.recursive.update=true
# ====================================================================
# Distcp configurations (do not change)
# ====================================================================
type=hadoopJava
job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
extract.namespace=org.apache.gobblin.copy
 data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
source.class=org.apache.gobblin.data.management.copy.CopySource
 writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
converter.classes=org.apache.gobblin.converter.IdentityConverter
task.maxretries=0
workunit.retry.enabled=false
distcp.persist.dir=/tmp/distcp-persist-dir{code}


> Spurious PostPublishStep WorkUnits with CREATE TABLE/ADD PARTITION for HIVE table copy
> --------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1395
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1395
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: hive-registration
>    Affects Versions: 0.15.0
>            Reporter: Sridivakar
>            Assignee: Abhishek Tiwari
>            Priority: Minor
>              Labels: Hive, PayPal, Performance
>
> For Hive Table incremental copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations :
>  # Creates many {{PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps with CREATE TABLE}}, for a source table with P number of total partitions, irrespective of partitions to be moved during the increment.
>  # Also creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not create any CopyableFile work units for those partitions.
> Because of these spurious calls, it impacts the performance. For instance, incremental data movement duration is increasing day by day; as the number of WorkUnits gets increases.
> *Steps to reproduce :*
> h5. Step 1)
> a) create a table with 5 partitions, with some rows in each partition
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  Time taken: 1.287 seconds, Fetched: 5 row(s)
> {code}
> b) Do DataMovement with the below mentioned Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O) : 0}}
>  {{Total No. of new partitions in the table (N): 5}}
>  {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
>  {{CopyableFile WorkUnits: 5 (one for each partition)}}
>  {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
> {quote}
> h5. Step 2)
> a) add 5 more partitions, with some rows in each partition 
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  dt=2021-01-01
>  dt=2021-01-02
>  dt=2021-01-03
>  dt=2021-01-04
>  dt=2021-01-05
>  Time taken: 0.131 seconds, Fetched: 10 row(s)
> {code}
>  _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_
> b) Do DataMovement with the below Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 5}}
>  {{Total No. of new partitions in the table (N): 5}}
>  {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
>  {{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
>  {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
> {quote}
> h5. Step 3)
> a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  dt=2020-12-31
>  dt=2021-01-01
>  dt=2021-01-02
>  dt=2021-01-03
>  dt=2021-01-04
>  dt=2021-01-05
>  Time taken: 0.101 seconds, Fetched: 11 row(s)
> {code}
> b) Do DataMovement with the below Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 10}}}}
>  {{Total No. of new partitions in the table (N): 1}}
>  {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
>  {{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
>  {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
> {quote}
>  
>  
> h4. +Job Configuration used:+
> {code:java}
> job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
> job.description=Test Gobblin job for copy
> # target location for copy
> data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
> gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
> source.filebased.fs.uri="hdfs://localhost:8020"
> hive.dataset.hive.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.table.root=${data.publisher.final.dir}
> hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.database=tc_db_copy_1
> hive.db.root.dir=${data.publisher.final.dir}
> # writer.fs.uri="hdfs://127.0.0.1:8020/"
> hive.dataset.whitelist=tc_db.tc_p5_r10
> gobblin.copy.recursive.update=true
> # ====================================================================
> # Distcp configurations (do not change)
> # ====================================================================
> type=hadoopJava
> job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
> extract.namespace=org.apache.gobblin.copy
>  data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
> source.class=org.apache.gobblin.data.management.copy.CopySource
>  writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
> converter.classes=org.apache.gobblin.converter.IdentityConverter
> task.maxretries=0
> workunit.retry.enabled=false
> distcp.persist.dir=/tmp/distcp-persist-dir{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)