You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Sridivakar (Jira)" <ji...@apache.org> on 2021/02/25 18:03:00 UTC
[jira] [Created] (GOBBLIN-1395) Spurious PostPublishStep WorkUnits for CREATE TABLE/ADD PARTITIONs for HIVE table copy

Sridivakar created GOBBLIN-1395:
-----------------------------------

             Summary: Spurious PostPublishStep WorkUnits for CREATE TABLE/ADD PARTITIONs for HIVE table copy
                 Key: GOBBLIN-1395
                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1395
             Project: Apache Gobblin
          Issue Type: Bug
          Components: hive-registration
    Affects Versions: 0.15.0
            Reporter: Sridivakar
            Assignee: Abhishek Tiwari


For Hive copy, Gobblin creates spurious {{PostPublishSteps}}  :
 # Creates so many {{PostPublishSteps }}with CREATE TABLE. It is observed that it creates a total of P number of {{PostPublishSteps}}, for a source table with P number of partitions.
 # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions.

 

*Steps to reproduce :*
h5. Step 1)

a) create a table with 5 partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 Time taken: 1.287 seconds, Fetched: 5 row(s)
{code}
b) Do DataMovement with the below mentioned Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O) : 0}}
{{Total No. of new partitions in the table (N): 5}}
{{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
{{CopyableFile WorkUnits: 5 (one for each partition)}}
{{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
{quote}
h5. Step 2)

a) add 5 more partitions, with some rows in each partition 
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.131 seconds, Fetched: 10 row(s)
{code}
 _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_

b) Do DataMovement with the below Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O): 5}}
{{Total No. of new partitions in the table (N): 5}}
{{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
{{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
{{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
{quote}
h5. Step 3)

a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2020-12-31
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.101 seconds, Fetched: 11 row(s)
{code}
b) Do DataMovement with the below Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O): 10}}}}
{{Total No. of new partitions in the table (N): 1}}
{{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
{{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
{{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
{quote}
 

 
h4. +Job Configuration used:+
{code:java}
job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
job.description=Test Gobblin job for copy
# target location for copy
data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
source.filebased.fs.uri="hdfs://localhost:8020"
hive.dataset.hive.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.table.root=${data.publisher.final.dir}
hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.database=tc_db_copy_1
hive.db.root.dir=${data.publisher.final.dir}
# writer.fs.uri="hdfs://127.0.0.1:8020/"
hive.dataset.whitelist=tc_db.tc_p5_r10
gobblin.copy.recursive.update=true
# ====================================================================
# Distcp configurations (do not change)
# ====================================================================
type=hadoopJava
job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
extract.namespace=org.apache.gobblin.copy
 data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
source.class=org.apache.gobblin.data.management.copy.CopySource
 writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
converter.classes=org.apache.gobblin.converter.IdentityConverter
task.maxretries=0
workunit.retry.enabled=false
distcp.persist.dir=/tmp/distcp-persist-dir{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)