You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Sridivakar (Jira)" <ji...@apache.org> on 2021/02/25 18:07:00 UTC
[jira] [Updated] (GOBBLIN-1395) Spurious PostPublishStep WorkUnits
for CREATE TABLE/ADD PARTITIONs for HIVE table copy
[ https://issues.apache.org/jira/browse/GOBBLIN-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sridivakar updated GOBBLIN-1395:
--------------------------------
Description:
For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations :
# Creates so many {{PostPublishSteps }}with CREATE TABLE. It is observed that it creates a total of P number of {{PostPublishSteps}}, for a source table with P number of partitions.
# Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions.
*Steps to reproduce :*
h5. Step 1)
a) create a table with 5 partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
OK
dt=2020-12-26
dt=2020-12-27
dt=2020-12-28
dt=2020-12-29
dt=2020-12-30
Time taken: 1.287 seconds, Fetched: 5 row(s)
{code}
b) Do DataMovement with the below mentioned Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O) : 0}}
{{Total No. of new partitions in the table (N): 5}}
{{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
{{CopyableFile WorkUnits: 5 (one for each partition)}}
{{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
{quote}
h5. Step 2)
a) add 5 more partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
OK
dt=2020-12-26
dt=2020-12-27
dt=2020-12-28
dt=2020-12-29
dt=2020-12-30
dt=2021-01-01
dt=2021-01-02
dt=2021-01-03
dt=2021-01-04
dt=2021-01-05
Time taken: 0.131 seconds, Fetched: 10 row(s)
{code}
_Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_
b) Do DataMovement with the below Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O): 5}}
{{Total No. of new partitions in the table (N): 5}}
{{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
{{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
{{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
{quote}
h5. Step 3)
a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
{code:sql}
hive> show partitions tc_p5_r10;
OK
dt=2020-12-26
dt=2020-12-27
dt=2020-12-28
dt=2020-12-29
dt=2020-12-30
dt=2020-12-31
dt=2021-01-01
dt=2021-01-02
dt=2021-01-03
dt=2021-01-04
dt=2021-01-05
Time taken: 0.101 seconds, Fetched: 11 row(s)
{code}
b) Do DataMovement with the below Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O): 10}}}}
{{Total No. of new partitions in the table (N): 1}}
{{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
{{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
{{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
{quote}
h4. +Job Configuration used:+
{code:java}
job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
job.description=Test Gobblin job for copy
# target location for copy
data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
source.filebased.fs.uri="hdfs://localhost:8020"
hive.dataset.hive.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.table.root=${data.publisher.final.dir}
hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.database=tc_db_copy_1
hive.db.root.dir=${data.publisher.final.dir}
# writer.fs.uri="hdfs://127.0.0.1:8020/"
hive.dataset.whitelist=tc_db.tc_p5_r10
gobblin.copy.recursive.update=true
# ====================================================================
# Distcp configurations (do not change)
# ====================================================================
type=hadoopJava
job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
extract.namespace=org.apache.gobblin.copy
data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
source.class=org.apache.gobblin.data.management.copy.CopySource
writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
converter.classes=org.apache.gobblin.converter.IdentityConverter
task.maxretries=0
workunit.retry.enabled=false
distcp.persist.dir=/tmp/distcp-persist-dir{code}
was:
For Hive copy, Gobblin creates spurious {{PostPublishSteps}} :
# Creates so many {{PostPublishSteps }}with CREATE TABLE. It is observed that it creates a total of P number of {{PostPublishSteps}}, for a source table with P number of partitions.
# Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions.
*Steps to reproduce :*
h5. Step 1)
a) create a table with 5 partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
OK
dt=2020-12-26
dt=2020-12-27
dt=2020-12-28
dt=2020-12-29
dt=2020-12-30
Time taken: 1.287 seconds, Fetched: 5 row(s)
{code}
b) Do DataMovement with the below mentioned Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O) : 0}}
{{Total No. of new partitions in the table (N): 5}}
{{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
{{CopyableFile WorkUnits: 5 (one for each partition)}}
{{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
{quote}
h5. Step 2)
a) add 5 more partitions, with some rows in each partition
{code:sql}
hive> show partitions tc_p5_r10;
OK
dt=2020-12-26
dt=2020-12-27
dt=2020-12-28
dt=2020-12-29
dt=2020-12-30
dt=2021-01-01
dt=2021-01-02
dt=2021-01-03
dt=2021-01-04
dt=2021-01-05
Time taken: 0.131 seconds, Fetched: 10 row(s)
{code}
_Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_
b) Do DataMovement with the below Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O): 5}}
{{Total No. of new partitions in the table (N): 5}}
{{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
{{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
{{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
{quote}
h5. Step 3)
a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
{code:sql}
hive> show partitions tc_p5_r10;
OK
dt=2020-12-26
dt=2020-12-27
dt=2020-12-28
dt=2020-12-29
dt=2020-12-30
dt=2020-12-31
dt=2021-01-01
dt=2021-01-02
dt=2021-01-03
dt=2021-01-04
dt=2021-01-05
Time taken: 0.101 seconds, Fetched: 11 row(s)
{code}
b) Do DataMovement with the below Job configuration
c) Observations :
{quote}{{Total No. of old partitions in the table (O): 10}}}}
{{Total No. of new partitions in the table (N): 1}}
{{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
{{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
{{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
{quote}
h4. +Job Configuration used:+
{code:java}
job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
job.description=Test Gobblin job for copy
# target location for copy
data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
source.filebased.fs.uri="hdfs://localhost:8020"
hive.dataset.hive.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.table.root=${data.publisher.final.dir}
hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.database=tc_db_copy_1
hive.db.root.dir=${data.publisher.final.dir}
# writer.fs.uri="hdfs://127.0.0.1:8020/"
hive.dataset.whitelist=tc_db.tc_p5_r10
gobblin.copy.recursive.update=true
# ====================================================================
# Distcp configurations (do not change)
# ====================================================================
type=hadoopJava
job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
extract.namespace=org.apache.gobblin.copy
data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
source.class=org.apache.gobblin.data.management.copy.CopySource
writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
converter.classes=org.apache.gobblin.converter.IdentityConverter
task.maxretries=0
workunit.retry.enabled=false
distcp.persist.dir=/tmp/distcp-persist-dir{code}
> Spurious PostPublishStep WorkUnits for CREATE TABLE/ADD PARTITIONs for HIVE table copy
> --------------------------------------------------------------------------------------
>
> Key: GOBBLIN-1395
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1395
> Project: Apache Gobblin
> Issue Type: Bug
> Components: hive-registration
> Affects Versions: 0.15.0
> Reporter: Sridivakar
> Assignee: Abhishek Tiwari
> Priority: Minor
>
> For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations :
> # Creates so many {{PostPublishSteps }}with CREATE TABLE. It is observed that it creates a total of P number of {{PostPublishSteps}}, for a source table with P number of partitions.
> # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions.
>
> *Steps to reproduce :*
> h5. Step 1)
> a) create a table with 5 partitions, with some rows in each partition
> {code:sql}
> hive> show partitions tc_p5_r10;
> OK
> dt=2020-12-26
> dt=2020-12-27
> dt=2020-12-28
> dt=2020-12-29
> dt=2020-12-30
> Time taken: 1.287 seconds, Fetched: 5 row(s)
> {code}
> b) Do DataMovement with the below mentioned Job configuration
> c) Observations :
> {quote}{{Total No. of old partitions in the table (O) : 0}}
> {{Total No. of new partitions in the table (N): 5}}
> {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
> {{CopyableFile WorkUnits: 5 (one for each partition)}}
> {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}}
> {quote}
> h5. Step 2)
> a) add 5 more partitions, with some rows in each partition
> {code:sql}
> hive> show partitions tc_p5_r10;
> OK
> dt=2020-12-26
> dt=2020-12-27
> dt=2020-12-28
> dt=2020-12-29
> dt=2020-12-30
> dt=2021-01-01
> dt=2021-01-02
> dt=2021-01-03
> dt=2021-01-04
> dt=2021-01-05
> Time taken: 0.131 seconds, Fetched: 10 row(s)
> {code}
> _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_
> b) Do DataMovement with the below Job configuration
> c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 5}}
> {{Total No. of new partitions in the table (N): 5}}
> {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
> {{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
> {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}}
> {quote}
> h5. Step 3)
> a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
> {code:sql}
> hive> show partitions tc_p5_r10;
> OK
> dt=2020-12-26
> dt=2020-12-27
> dt=2020-12-28
> dt=2020-12-29
> dt=2020-12-30
> dt=2020-12-31
> dt=2021-01-01
> dt=2021-01-02
> dt=2021-01-03
> dt=2021-01-04
> dt=2021-01-05
> Time taken: 0.101 seconds, Fetched: 11 row(s)
> {code}
> b) Do DataMovement with the below Job configuration
> c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 10}}}}
> {{Total No. of new partitions in the table (N): 1}}
> {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
> {{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
> {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}}
> {quote}
>
>
> h4. +Job Configuration used:+
> {code:java}
> job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
> job.description=Test Gobblin job for copy
> # target location for copy
> data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
> gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
> source.filebased.fs.uri="hdfs://localhost:8020"
> hive.dataset.hive.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.table.root=${data.publisher.final.dir}
> hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.database=tc_db_copy_1
> hive.db.root.dir=${data.publisher.final.dir}
> # writer.fs.uri="hdfs://127.0.0.1:8020/"
> hive.dataset.whitelist=tc_db.tc_p5_r10
> gobblin.copy.recursive.update=true
> # ====================================================================
> # Distcp configurations (do not change)
> # ====================================================================
> type=hadoopJava
> job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
> extract.namespace=org.apache.gobblin.copy
> data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
> source.class=org.apache.gobblin.data.management.copy.CopySource
> writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
> converter.classes=org.apache.gobblin.converter.IdentityConverter
> task.maxretries=0
> workunit.retry.enabled=false
> distcp.persist.dir=/tmp/distcp-persist-dir{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)