You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Fenggang Wang (JIRA)" <ji...@apache.org> on 2018/12/21 06:30:00 UTC

[jira] [Commented] (HAWQ-1094) Select on INTERNAL table returns wrong results when hdfs blocks have checksum errors

    [ https://issues.apache.org/jira/browse/HAWQ-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726471#comment-16726471 ] 

Fenggang Wang commented on HAWQ-1094:
-------------------------------------

Hello [~mli],  we have a customer who is still running hawq version 2.0.0. They have met this very issue in their environment.
I have already suggested them to upgrade hawq to newest version with fix on this issue.
https://github.com/apache/hawq/commit/5f3db03a7b52c4b14575a09f8af90e8a6be69192

On the other hand, they still have questions about this issue.  These two questions are just from your description.

Questions:

1. When we detect a bad checksum on the local block, why do we not mark the block as corrupt with the NameNode?
2. When we read the block using RemoteBlockReader, why doesn't it detect the bad block?

And they would rather add another question
3. Why fsck's output is healthy, but checksum return value from libhdfs3 is chechsum error.

Appreciate if you could shed some light on these questions.

thanks
WangFG

> Select on INTERNAL table returns wrong results when hdfs blocks have checksum errors
> ------------------------------------------------------------------------------------
>
>                 Key: HAWQ-1094
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1094
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Fault Tolerance
>            Reporter: Ming LI
>            Assignee: Ming LI
>            Priority: Major
>             Fix For: backlog
>
>
> I created a parquet table and inserted the following values into the table:
> {code}
> sr37228_repro=# select * from number;
>  id
> ----
>   1
>   1
>   1
>   1
>   1
> (5 rows)
> {code}
> I then modified the data in two of the three blocks and tried reading the data again.
> {code}
> Modifying contents of internal table blocks...
> Found hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 in hdfs
> Modifying block /hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008 on 172.28.21.155
> block_script.sh                                                                                                        100%  228     0.2KB/s   00:00
> Modifying block /hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008 on 172.28.21.156
> block_script.sh                                                                                                        100%  228     0.2KB/s   00:00
> Running count query again, this time with bad data in two of the three blocks
>  count |    id
> -------+----------
>      1 |        0
>      2 |        1
>      1 | 16777216
>      1 | 16777217
> (4 rows)
> Checking Showing file health:
> Checking hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 health
> Connecting to namenode via http://hdm1.hdp.local:50070/fsck?ugi=gpadmin&blocks=1&locations=1&files=1&path=%2Fhawq_default%2F16385%2F16543%2F17000%2F10
> FSCK started by gpadmin (auth:SIMPLE) from /172.28.21.157 for path /hawq_default/16385/16543/17000/10 at Mon Sep 26 12:07:53 PDT 2016
> /hawq_default/16385/16543/17000/10 206 bytes, 1 block(s):  OK
> 0. BP-2023073008-172.28.21.63-1462922052672:blk_1073742008_1186 len=206 repl=3 [DatanodeInfoWithStorage[172.28.21.155:50010,DS-1a18c785-48e5-4ab8-9228-b3f6857b952a,DISK], DatanodeInfoWithStorage[172.28.19.211:50010,DS-6bf49ae7-6745-448b-803d-d12d93acad1d,DISK], DatanodeInfoWithStorage[172.28.21.156:50010,DS-d22b0f7f-7065-42c4-bb66-ea361ec5e56a,DISK]]
> Status: HEALTHY
>  Total size:    206 B
>  Total dirs:    0
>  Total files:   1
>  Total symlinks:                0
>  Total blocks (validated):      1 (avg. block size 206 B)
>  Minimally replicated blocks:   1 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          3
>  Number of racks:               1
> FSCK ended at Mon Sep 26 12:07:53 PDT 2016 in 0 milliseconds
> {code}
> When setupBlockReader reads a bad block using the LocalBlockReader, the reader correctly detects a bad checksum.
> {code}
> 2016-09-26 13:02:09.267021 PDT,,,p380682,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 127.0.0.1",,,,,,,0,,"network_utils.c",210,
> 2016-09-26 13:02:09.267171 PDT,,,p380682,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 172.28.21.155",,,,,,,0,,"network_utils.c",210,
> 2016-09-26 13:02:16.239048 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","00000","Dropping in memory mapping OidInMemHeapMapping",,,,,,"SET log_min_messages TO 'debug5'",0,,"cdbinmemheapam.c",293,
> 2016-09-26 13:02:16.239289 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","CommitTransactionCommand",,,,,,"SET log_min_messages TO 'debug5'",0,,"postgres.c",3131,
> 2016-09-26 13:02:16.239435 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","CommitTransaction",,,,,,"SET log_min_messages TO 'debug5'",0,,"xact.c",5103,
> 2016-09-26 13:02:16.239819 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","name: unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid: 6227/1/0, nestlvl: 1, children: <>",,,,,,"SET log_min_messages TO 'debug5'",0,,"xact.c",5128,
> 2016-09-26 13:02:16.239978 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","00000","Dropping in memory mapping OidInMemOnlyMapping",,,,,,"SET log_min_messages TO 'debug5'",0,,"cdbinmemheapam.c",293,
> 2016-09-26 13:02:25.600367 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,,seg1,,,,,"DEBUG5","00000","First char: 'M'; gp_role = 'execute'.",,,,,,,0,,"postgres.c",4737,
> 2016-09-26 13:02:25.600639 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","Message type M received by from libpq, len = 1412",,,,,,,0,,"postgres.c",4813,
> 2016-09-26 13:02:25.600742 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG5","00000","MPP dispatched stmt from QD: explain analyze select * from number;.",,,,,,,0,,"postgres.c",4893,
> 2016-09-26 13:02:25.600847 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","SetupProcessIdentity: receive msg: ProcessIdentity_Begin_slice_1_idx_0_gang_1_cmd_74_writer_t_End_ProcessIdentity",,,,,,,0,,"identity.c",365,
> 2016-09-26 13:02:25.600997 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","ProcessIdentity is not init",,,,,,,0,,"identity.c",599,
> 2016-09-26 13:02:25.601129 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","ProcessIdentity: slice 1 id 0 gang num 1 writer t",,,,,,,0,,"identity.c",602,
> 2016-09-26 13:02:25.601250 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG5","00000","Get a temporary directory:/tmp/hawq/segment",,,,,,,0,,"cdbtmpdir.c",48,
> 2016-09-26 13:02:25.601351 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG1","00000","getLocalTmpDirFromSegmentConfig session_id:143 command_id:74 qeidx:0 tmpdir:/tmp/hawq/segment",,,,,,,0,,"identity.c",418,
> 2016-09-26 13:02:25.601784 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG3","00000","StartTransactionCommand",,,,,,"explain analyze select * from number;",0,,"postgres.c",3107,
> 2016-09-26 13:02:25.602075 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","StartTransaction",,,,,,"explain analyze select * from number;",0,,"xact.c",5103,
> 2016-09-26 13:02:25.602195 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","name: unnamed; blockState:       DEFAULT; state: INPROGR, xid/subid/cid: 6228/1/0, nestlvl: 1, children: <>",,,,,,"explain analyze select * from number;",0,,"xact.c",5128,
> 2016-09-26 13:02:25.602578 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 0 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.602703 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 1 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.602836 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 2 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.602994 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 3 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.603104 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 4 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.603211 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 5 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.603317 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 6 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.603572 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 7 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.603751 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 8 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.603881 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 9 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.604003 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 10 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.604110 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 11 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.604216 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 12 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.604323 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 13 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.604555 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 14 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.604697 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 15 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.604848 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 16 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.604959 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 17 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.605064 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 18 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
> 2016-09-26 13:02:25.605591 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","Resource enforcer finds cpu sub-system is disabled",,,,,,"explain analyze select * from number;",0,,"resourceenforcer.c",908,
> 2016-09-26 13:02:25.605716 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG2","00000","Current nice level of the process: 19",,,,,,"explain analyze select * from number;",0,,"postgres.c",283,
> 2016-09-26 13:02:25.605856 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG2","00000","Reniced process to level 19",,,,,,"explain analyze select * from number;",0,,"postgres.c",302,
> 2016-09-26 13:02:25.606073 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG5","00000","GetSnapshotData setting globalxmin and xmin to 6228",,,,,,"explain analyze select * from number;",0,,"procarray.c",552,
> 2016-09-26 13:02:25.606306 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","Inserted entry for query (sessionid=143, commandcnt=74)",,,,,,"explain analyze select * from number;",0,,"workfile_queryspace.c",283,
> 2016-09-26 13:02:25.606748 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","Have both IPv6 and IPv4 choices",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1291,
> 2016-09-26 13:02:25.606978 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","receive socket ai_family 10 ai_socktype 2 ai_protocol 17",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1303,
> 2016-09-26 13:02:25.607098 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","receive socket 6 ai_family 10 ai_socktype 2 ai_protocol 17",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1307,
> 2016-09-26 13:02:25.607207 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","bind addrlen 28 fam 10",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1318,
> 2016-09-26 13:02:25.607320 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit default buffer size 124928 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2200,
> 2016-09-26 13:02:25.607555 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit use buffer size 2097152 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2215,
> 2016-09-26 13:02:25.607678 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit default buffer size 124928 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2200,
> 2016-09-26 13:02:25.607787 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit use buffer size 2097152 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2215,
> 2016-09-26 13:02:25.607939 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","GetSockAddr socket ai_family 2 ai_socktype 2 ai_protocol 17 for 172.28.21.157",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",3058,
> 2016-09-26 13:02:25.608052 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","We are inet6, remote is inet.  Converting to v4 mapped address.",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",3137,
> 2016-09-26 13:02:25.608249 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 0 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
> 2016-09-26 13:02:25.608706 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 1 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
> 2016-09-26 13:02:25.608836 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 2 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
> 2016-09-26 13:02:25.608966 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 3 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
> 2016-09-26 13:02:25.609083 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 4 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
> 2016-09-26 13:02:25.609200 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 5 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
> 2016-09-26 13:02:25.609316 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 6 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
> 2016-09-26 13:02:25.609657 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 7 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
> 2016-09-26 13:02:25.613152 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG5","00000","Parquet metadata file footer length index: 198",,,,,,"explain analyze select * from number;",0,,"cdbparquetfooterprocessor.c",141,
> 2016-09-26 13:02:25.676719 PDT,,,p380675,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","3rd party error log:
> 2016-09-26 13:02:25.676477, p384452, th140708219193472, ERROR cannot setup block reader for Block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186] file /hawq_default/16385/16543/17000/10 on Datanode: hdw2.hdp.local(172.28.21.155).
> LocalBlockReader.cpp: 127: HdfsIOException: Failed to construct LocalBlockReader for block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186].
>         @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo> const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&, std::vector<char, std::allocator<char> >&)
>         @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
>         @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
>         @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
>         @       Hdfs::Internal::InputStreamImpl::read(char*, int)
>         @       hdfsRead
>         @       gpfs_hdfs_read
>         @       HdfsRead
>         @       FileRead
>         @       readParquetFooter
>         @       ParquetStorageRead_OpenFile
>         @       parquet_getnext
>         @       ParquetScanNext
>         @       ExecTableScan
>         @       ExecProcNode
>         @       ExecMotion
>         @       ExecProcNode
>         @       ExecutePlan
>         @       ExecutorRun
>         @       PortalRunSelect
>         @       PortalRun
>         @       PostgresMain
>         @       BackendStartup
>         @       ServerLoop
>         @       PostmasterMain
>         @       main
>         @       __libc_start_main
>         @       Unknown
> Caused by
> LocalBlockReader.cpp: 283: HdfsIOException: LocalBlockReader failed to skip from position: 0, length: 0, block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186].
>         @       Hdfs::Internal::LocalBlockReader::skip(long)
>         @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo> const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&, std::vector<char, std::allocator<char> >&)
>         @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
>         @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
>         @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
>         @       Hdfs::Internal::InputStreamImpl::read(char*, int)
>         @       hdfsRead
>         @       gpfs_hdfs_read
>         @       HdfsRead
>         @       FileRead
>         @       readParquetFooter
>         @       ParquetStorageRead_OpenFile
>         @       parquet_getnext
>         @       ParquetScanNext
>         @       ExecTableScan
>         @       ExecProcNode
>         @       ExecMotion
>         @       ExecProcNode
>         @       ExecutePlan
>         @       ExecutorRun
>         @       PortalRunSelect
>         @       PortalRun
>         @       PostgresMain
>         @       BackendStartup
>         @       ServerLoop
>         @       PostmasterMain
>         @       main
>         @       __libc_start_main
>         @       Unknown
> Caused by
> LocalBlockReader.cpp: 156: ChecksumException: LocalBlockReader checksum not match for block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186]
>         @       Hdfs::Internal::LocalBlockReader::readAndVerify(int)
>         @       Hdfs::Internal::LocalBlockReader::skip(long)
>         @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo> const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&, std::vector<char, std::allocator<char> >&)
>         @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
>         @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
>         @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
>         @       Hdfs::Internal::InputStreamImpl::read(char*, int)
>         @       hdfsRead
>         @       gpfs_hdfs_read
>         @       HdfsRead
>         @       FileRead
>         @       readParquetFooter
>         @       ParquetStorageRead_OpenFile
>         @       parquet_getnext
>         @       ParquetScanNext
>         @       ExecTableScan
>         @       ExecProcNode
>         @       ExecMotion
>         @       ExecProcNode
>         @       ExecutePlan
>         @       ExecutorRun
>         @       PortalRunSelect
>         @       PortalRun
>         @       PostgresMain
>         @       BackendStartup
>         @       ServerLoop
>         @       PostmasterMain
>         @       main
>         @       __libc_start_main
>         @       Unknown
> retry the same node but disable read shortcircuit feature",,,,,,,,"SysLoggerMain","syslogger.c",518,
> 2016-09-26 13:02:25.680638 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 
> {code}
> Even though it correctly detected the bad checksum using the LocalBlockReader, when it calls the RemoteBlockReader it does not appear to detect the bad checksum, and the read is allowed to go through.
> {code}
> sr37228_repro=# select * from number;
>     id
> ----------
>  16777217
>  16777216
>         0
>         1
>         1
> (5 rows)
> Checking hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 health
> Connecting to namenode via http://hdm1.hdp.local:50070/fsck?ugi=gpadmin&blocks=1&locations=1&files=1&path=%2Fhawq_default%2F16385%2F16543%2F17000%2F10
> FSCK started by gpadmin (auth:SIMPLE) from /172.28.21.157 for path /hawq_default/16385/16543/17000/10 at Mon Sep 26 12:07:53 PDT 2016
> /hawq_default/16385/16543/17000/10 206 bytes, 1 block(s):  OK
> 0. BP-2023073008-172.28.21.63-1462922052672:blk_1073742008_1186 len=206 repl=3 [DatanodeInfoWithStorage[172.28.21.155:50010,DS-1a18c785-48e5-4ab8-9228-b3f6857b952a,DISK], DatanodeInfoWithStorage[172.28.19.211:50010,DS-6bf49ae7-6745-448b-803d-d12d93acad1d,DISK], DatanodeInfoWithStorage[172.28.21.156:50010,DS-d22b0f7f-7065-42c4-bb66-ea361ec5e56a,DISK]]
> Status: HEALTHY
>  Total size:    206 B
>  Total dirs:    0
>  Total files:   1
>  Total symlinks:                0
>  Total blocks (validated):      1 (avg. block size 206 B)
>  Minimally replicated blocks:   1 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          3
>  Number of racks:               1
> FSCK ended at Mon Sep 26 12:07:53 PDT 2016 in 0 milliseconds
> The filesystem under path '/hawq_default/16385/16543/17000/10' is HEALTHY
> {code}
> The behavior of InputStreamImpl::setupBlockReader appears to be to: 
> 1. Attempt to read the block locally using LocalBlockReader
> 2. If the local block read fails, attempt to read the block from the next available node using RemoteBlockReader
> 3. Continue to read all the available blocks using RemoteBlockReader until we have no more blocks to read.
> In this case, the RemoteBlockReader appears to ignore the bad checksum in the block, and returns wrong results.
> Questions:
> 1. When we detect a bad checksum on the local block, why do we not mark the block as corrupt with the NameNode?
> 2. When we read the block using RemoteBlockReader, why doesn't it detect the bad block?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)