You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Sridivakar (Jira)" <ji...@apache.org> on 2021/02/25 18:07:00 UTC
[jira] [Updated] (GOBBLIN-1395) Spurious PostPublishStep WorkUnits for CREATE TABLE/ADD PARTITIONs for HIVE table copy

     [ https://issues.apache.org/jira/browse/GOBBLIN-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sridivakar updated GOBBLIN-1395:
--------------------------------
    Description: 
For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations :
 # Creates so many {{PostPublishSteps }}with CREATE TABLE. It is observed that it creates a total of P number of {{PostPublishSteps}}, for a source table with P number of partitions.
 # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions.

 

*Steps to reproduce :*
h5. Step 1)

a) create a table with 5 partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 Time taken: 1.287 seconds, Fetched: 5 row(s)
{code}
b) Do DataMovement with the below mentioned Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O) : 0}}
 {{Total No. of new partitions in the table (N): 5}}
 {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
 {{CopyableFile WorkUnits: 5 (one for each partition)}}
 {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
{quote}
h5. Step 2)

a) add 5 more partitions, with some rows in each partition 
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.131 seconds, Fetched: 10 row(s)
{code}
 _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_

b) Do DataMovement with the below Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O): 5}}
 {{Total No. of new partitions in the table (N): 5}}
 {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
 {{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
 {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
{quote}
h5. Step 3)

a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2020-12-31
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.101 seconds, Fetched: 11 row(s)
{code}
b) Do DataMovement with the below Job configuration
 c) Observations :
{quote}{{Total No. of old partitions in the table (O): 10}}}}
 {{Total No. of new partitions in the table (N): 1}}
 {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
 {{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
 {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
{quote}
 

 
h4. +Job Configuration used:+
{code:java}
job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
job.description=Test Gobblin job for copy
# target location for copy
data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
source.filebased.fs.uri="hdfs://localhost:8020"
hive.dataset.hive.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.table.root=${data.publisher.final.dir}
hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.database=tc_db_copy_1
hive.db.root.dir=${data.publisher.final.dir}
# writer.fs.uri="hdfs://127.0.0.1:8020/"
hive.dataset.whitelist=tc_db.tc_p5_r10
gobblin.copy.recursive.update=true
# ====================================================================
# Distcp configurations (do not change)
# ====================================================================
type=hadoopJava
job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
extract.namespace=org.apache.gobblin.copy
 data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
source.class=org.apache.gobblin.data.management.copy.CopySource
 writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
converter.classes=org.apache.gobblin.converter.IdentityConverter
task.maxretries=0
workunit.retry.enabled=false
distcp.persist.dir=/tmp/distcp-persist-dir{code}

  was:
For Hive copy, Gobblin creates spurious {{PostPublishSteps}}  :
 # Creates so many {{PostPublishSteps }}with CREATE TABLE. It is observed that it creates a total of P number of {{PostPublishSteps}}, for a source table with P number of partitions.
 # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions.

 

*Steps to reproduce :*
h5. Step 1)

a) create a table with 5 partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 Time taken: 1.287 seconds, Fetched: 5 row(s)
{code}
b) Do DataMovement with the below mentioned Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O) : 0}}
{{Total No. of new partitions in the table (N): 5}}
{{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
{{CopyableFile WorkUnits: 5 (one for each partition)}}
{{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
{quote}
h5. Step 2)

a) add 5 more partitions, with some rows in each partition 
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.131 seconds, Fetched: 10 row(s)
{code}
 _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_

b) Do DataMovement with the below Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O): 5}}
{{Total No. of new partitions in the table (N): 5}}
{{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
{{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
{{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
{quote}
h5. Step 3)

a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
{code:sql}
hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2020-12-31
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.101 seconds, Fetched: 11 row(s)
{code}
b) Do DataMovement with the below Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O): 10}}}}
{{Total No. of new partitions in the table (N): 1}}
{{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
{{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
{{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
{quote}
 

 
h4. +Job Configuration used:+
{code:java}
job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
job.description=Test Gobblin job for copy
# target location for copy
data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
source.filebased.fs.uri="hdfs://localhost:8020"
hive.dataset.hive.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.table.root=${data.publisher.final.dir}
hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.database=tc_db_copy_1
hive.db.root.dir=${data.publisher.final.dir}
# writer.fs.uri="hdfs://127.0.0.1:8020/"
hive.dataset.whitelist=tc_db.tc_p5_r10
gobblin.copy.recursive.update=true
# ====================================================================
# Distcp configurations (do not change)
# ====================================================================
type=hadoopJava
job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
extract.namespace=org.apache.gobblin.copy
 data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
source.class=org.apache.gobblin.data.management.copy.CopySource
 writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
converter.classes=org.apache.gobblin.converter.IdentityConverter
task.maxretries=0
workunit.retry.enabled=false
distcp.persist.dir=/tmp/distcp-persist-dir{code}


> Spurious PostPublishStep WorkUnits for CREATE TABLE/ADD PARTITIONs for HIVE table copy
> --------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1395
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1395
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: hive-registration
>    Affects Versions: 0.15.0
>            Reporter: Sridivakar
>            Assignee: Abhishek Tiwari
>            Priority: Minor
>
> For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations :
>  # Creates so many {{PostPublishSteps }}with CREATE TABLE. It is observed that it creates a total of P number of {{PostPublishSteps}}, for a source table with P number of partitions.
>  # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions.
>  
> *Steps to reproduce :*
> h5. Step 1)
> a) create a table with 5 partitions, with some rows in each partition
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  Time taken: 1.287 seconds, Fetched: 5 row(s)
> {code}
> b) Do DataMovement with the below mentioned Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O) : 0}}
>  {{Total No. of new partitions in the table (N): 5}}
>  {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
>  {{CopyableFile WorkUnits: 5 (one for each partition)}}
>  {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
> {quote}
> h5. Step 2)
> a) add 5 more partitions, with some rows in each partition 
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  dt=2021-01-01
>  dt=2021-01-02
>  dt=2021-01-03
>  dt=2021-01-04
>  dt=2021-01-05
>  Time taken: 0.131 seconds, Fetched: 10 row(s)
> {code}
>  _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_
> b) Do DataMovement with the below Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 5}}
>  {{Total No. of new partitions in the table (N): 5}}
>  {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
>  {{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
>  {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
> {quote}
> h5. Step 3)
> a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  dt=2020-12-31
>  dt=2021-01-01
>  dt=2021-01-02
>  dt=2021-01-03
>  dt=2021-01-04
>  dt=2021-01-05
>  Time taken: 0.101 seconds, Fetched: 11 row(s)
> {code}
> b) Do DataMovement with the below Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 10}}}}
>  {{Total No. of new partitions in the table (N): 1}}
>  {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
>  {{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
>  {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
> {quote}
>  
>  
> h4. +Job Configuration used:+
> {code:java}
> job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
> job.description=Test Gobblin job for copy
> # target location for copy
> data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
> gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
> source.filebased.fs.uri="hdfs://localhost:8020"
> hive.dataset.hive.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.table.root=${data.publisher.final.dir}
> hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.database=tc_db_copy_1
> hive.db.root.dir=${data.publisher.final.dir}
> # writer.fs.uri="hdfs://127.0.0.1:8020/"
> hive.dataset.whitelist=tc_db.tc_p5_r10
> gobblin.copy.recursive.update=true
> # ====================================================================
> # Distcp configurations (do not change)
> # ====================================================================
> type=hadoopJava
> job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
> extract.namespace=org.apache.gobblin.copy
>  data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
> source.class=org.apache.gobblin.data.management.copy.CopySource
>  writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
> converter.classes=org.apache.gobblin.converter.IdentityConverter
> task.maxretries=0
> workunit.retry.enabled=false
> distcp.persist.dir=/tmp/distcp-persist-dir{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)