You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Hive QA (Jira)" <ji...@apache.org> on 2019/12/05 02:40:00 UTC
[jira] [Commented] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

    [ https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988399#comment-16988399 ] 

Hive QA commented on HIVE-22579:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12987446/HIVE-22579.01.branch-2.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/19746/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/19746/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-19746/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2019-12-05 02:34:47.581
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-19746/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z branch-2 ]]
+ [[ -d apache-github-branch-2-source ]]
+ [[ ! -d apache-github-branch-2-source/.git ]]
+ [[ ! -d apache-github-branch-2-source ]]
+ date '+%Y-%m-%d %T.%3N'
2019-12-05 02:34:47.679
+ cd apache-github-branch-2-source
+ git fetch origin
From https://github.com/apache/hive
   7534f82..de0a7ec  branch-1   -> origin/branch-1
   9bcdb54..6002c51  branch-1.2 -> origin/branch-1.2
   292a98f..0359921  branch-2.1 -> origin/branch-2.1
   b148507..67f9139  branch-2.2 -> origin/branch-2.2
   f90975a..f55ee60  branch-3   -> origin/branch-3
   a354bed..0ecbd12  branch-3.0 -> origin/branch-3.0
   909c1dc..eb4d7c3  branch-3.1 -> origin/branch-3.1
   305e710..1ef05ef  master     -> origin/master
   e59fdf9..3638231  storage-branch-2.7 -> origin/storage-branch-2.7
 * [new tag]         rel/storage-release-2.7.1 -> rel/storage-release-2.7.1
+ git reset --hard HEAD
HEAD is now at a4a6101 HIVE-22249: Support Parquet through HCatalog (Jay Green-Stevens via Peter Vary)
+ git clean -f -d
+ git checkout branch-2
Already on 'branch-2'
Your branch is up-to-date with 'origin/branch-2'.
+ git reset --hard origin/branch-2
HEAD is now at a4a6101 HIVE-22249: Support Parquet through HCatalog (Jay Green-Stevens via Peter Vary)
+ git merge --ff-only origin/branch-2
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2019-12-05 02:35:11.141
+ rm -rf ../yetus_PreCommit-HIVE-Build-19746
+ mkdir ../yetus_PreCommit-HIVE-Build-19746
+ git gc
+ cp -R . ../yetus_PreCommit-HIVE-Build-19746
+ mkdir /data/hiveptest/logs/PreCommit-HIVE-Build-19746/yetus
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch
Going to apply patch with: git apply -p0
+ [[ maven == \m\a\v\e\n ]]
+ rm -rf /data/hiveptest/working/maven/org/apache/hive
+ mvn -B clean install -DskipTests -T 4 -q -Dmaven.repo.local=/data/hiveptest/working/maven
ANTLR Parser Generator  Version 3.5.2
Output file /data/hiveptest/working/apache-github-branch-2-source/metastore/target/generated-sources/antlr3/org/apache/hadoop/hive/metastore/parser/FilterParser.java does not exist: must build /data/hiveptest/working/apache-github-branch-2-source/metastore/src/java/org/apache/hadoop/hive/metastore/parser/Filter.g
org/apache/hadoop/hive/metastore/parser/Filter.g
DataNucleus Enhancer (version 4.1.17) for API "JDO"
DataNucleus Enhancer : Classpath
>>  /usr/share/maven/boot/plexus-classworlds-2.x.jar
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MDatabase
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MFieldSchema
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MType
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MTable
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MConstraint
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MSerDeInfo
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MOrder
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MColumnDescriptor
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MStringList
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MStorageDescriptor
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MPartition
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MIndex
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MRole
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MRoleMap
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MGlobalPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MDBPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MTablePrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MPartitionPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MTableColumnPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MPartitionColumnPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MPartitionEvent
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MMasterKey
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MDelegationToken
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MTableColumnStatistics
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MVersionTable
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MResourceUri
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MFunction
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MNotificationLog
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MNotificationNextId
DataNucleus Enhancer completed with success for 30 classes. Timings : input=186 ms, enhance=197 ms, total=383 ms. Consult the log for full details
ANTLR Parser Generator  Version 3.5.2
Output file /data/hiveptest/working/apache-github-branch-2-source/ql/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HiveLexer.java does not exist: must build /data/hiveptest/working/apache-github-branch-2-source/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g
org/apache/hadoop/hive/ql/parse/HiveLexer.g
Output file /data/hiveptest/working/apache-github-branch-2-source/ql/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HiveParser.java does not exist: must build /data/hiveptest/working/apache-github-branch-2-source/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g
org/apache/hadoop/hive/ql/parse/HiveParser.g
Output file /data/hiveptest/working/apache-github-branch-2-source/ql/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HintParser.java does not exist: must build /data/hiveptest/working/apache-github-branch-2-source/ql/src/java/org/apache/hadoop/hive/ql/parse/HintParser.g
org/apache/hadoop/hive/ql/parse/HintParser.g
Generating vector expression code
Generating vector expression test code
[ERROR] Failed to execute goal on project hive-hbase-handler: Could not resolve dependencies for project org.apache.hive:hive-hbase-handler:jar:2.4.0-SNAPSHOT: Could not find artifact org.apache.hbase:hbase-procedure:jar:1.1.1 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hive-hbase-handler
+ result=1
+ '[' 1 -ne 0 ']'
+ rm -rf yetus_PreCommit-HIVE-Build-19746
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12987446 - PreCommit-HIVE-Build

> ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-22579
>                 URL: https://issues.apache.org/jira/browse/HIVE-22579
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the delta-only buckets (having no base file) more than once, so there could be multiple OrcSplit instances generated for the same delta file, causing more tasks to read the same delta file more than once, causing duplicate records in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013
> -rw-r--r--   3 hive hadoop        353 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00012
> -rw-r--r--   3 hive hadoop       1642 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1635 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1808 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00043
> -rwxrwxrwx   3 hive hadoop       1633 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file for that, a select (*) with ETL split strategy can generate 2 splits for the same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763] which is [shared between the SplitInfo instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824] to be created
> 2. a SplitInfo instance is created for [every base file (2 in this case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926], and in the constructor, [parent's getSplit is called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251], which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, duplicated delta split can be generated, which can cause duplicated read later (note that both tasks read the same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171 with {include: [true, true, true, true, true, true, true, true, true, true, true, true], offset: 0, length: 9223372036854775807, schema: struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1, start=171, length=0, isOriginal=false, fileLength=9223372036854775807, hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 16 stmtIds: [0] }]]
> 2019-12-01T16:24:55,807  INFO [TezTR-127843_16_30_0_425_0 (1575040127843_0016_30_00_000425_0)] orc.ReaderImpl: Reading ORC rows from hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171 with {include: [true, true, true, true, true, true, true, true, true, true, true, true], offset: 0, length: 9223372036854775807, schema: struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:55,813  INFO [TezTR-127843_16_30_0_425_0 (1575040127843_0016_30_00_000425_0)] lib.MRReaderMapred: Processing split: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1, start=171, length=0, isOriginal=false, fileLength=9223372036854775807, hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 16 stmtIds: [0] }]]
> {code}
> seems like this issue doesn't affect AcidV2, as getSplits() returns an empty collection or throws an exception in case of unexpected deltas (which was the case here, where deltas was not unexpected):
> https://github.com/apache/hive/blob/8ee3497f87f81fa84ee1023e891dc54087c2cd5e/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1178-L1197



--
This message was sent by Atlassian Jira
(v8.3.4#803005)