You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Zhanwei Wang (JIRA)" <ji...@apache.org> on 2015/11/20 04:32:10 UTC

[jira] [Commented] (HAWQ-42) Query Executor Error (core dump)

    [ https://issues.apache.org/jira/browse/HAWQ-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15015142#comment-15015142 ] 

Zhanwei Wang commented on HAWQ-42:
----------------------------------

I did some investigation and got the following output.

1) Hawq QE core dump when read this file on hdfs /hawq/hawq20151113/16385/25115/26714/402
fsck report that the file is healthy.


{code}
hdfs fsck /hawq/hawq20151113/16385/25115/26714/402 -files -blocks -locations
Connecting to namenode via http://sfo-w35:50070/fsck?ugi=gpadmin&files=1&blocks=1&locations=1&path=%2Fhawq%2Fhawq20151113%2F16385%2F25115%2F26714%2F402
FSCK started by gpadmin (auth:SIMPLE) from /172.28.8.164 for path /hawq/hawq20151113/16385/25115/26714/402 at Fri Nov 20 02:21:49 UTC 2015
/hawq/hawq20151113/16385/25115/26714/402 401152 bytes, 1 block(s):  OK
0. BP-57962240-172.28.8.35-1447395561953:blk_1073919056_178351 len=401152 repl=3 [DatanodeInfoWithStorage[172.28.8.57:50010,DS-0f5ad53b-b6dd-48d8-bdda-8543011ce40d,DISK], DatanodeInfoWithStorage[172.28.8.164:50010,DS-9f4ef3cf-27aa-48cc-a980-40f9ff949bb0,DISK], DatanodeInfoWithStorage[172.28.8.82:50010,DS-5242b0aa-6bf5-4300-9993-e1f35e60253a,DISK]]

Status: HEALTHY
 Total size:    401152 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 401152 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		127
 Number of racks:		1
FSCK ended at Fri Nov 20 02:21:49 UTC 2015 in 1 milliseconds

{code}

2) But the file cannot be read by hdfs client.

{code}
[gpadmin@sfo-w164 wangzw]$ hdfs dfs -get  /hawq/hawq20151113/16385/25115/26714/402
15/11/20 03:15:51 WARN hdfs.DFSClient: Exception while reading from BP-57962240-172.28.8.35-1447395561953:blk_1073919056_178351 of /hawq/hawq20151113/16385/25115/26714/402 from DatanodeInfoWithStorage[172.28.8.164:50010,DS-9f4ef3cf-27aa-48cc-a980-40f9ff949bb0,DISK]
java.io.IOException: Input/output error
	at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
	at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:699)
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684)
	at org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:330)
	at org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:474)
	at org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:605)
	at org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:569)
	at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:737)
	at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:793)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:853)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
	at java.io.DataInputStream.read(DataInputStream.java:100)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
	at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:466)
	at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:391)
	at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:328)
	at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:263)
	at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:248)
	at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
	at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
	at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:243)
	at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
	at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
	at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:220)
	at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
15/11/20 03:15:51 WARN hdfs.DFSClient: DFSInputStream has been closed already

{code}

3) the file consistent only one block. Try to read the block file directly. It is cannot be read. The disk file corrupt.

{code}

[gpadmin@sfo-w164 wangzw]$ cat /data5/data/current/BP-57962240-172.28.8.35-1447395561953/current/finalized/subdir2/subdir180/blk_1073919056 > /dev/null
cat: /data5/data/current/BP-57962240-172.28.8.35-1447395561953/current/finalized/subdir2/subdir180/blk_1073919056: Input/output error

{code}



> Query Executor Error (core dump)
> --------------------------------
>
>                 Key: HAWQ-42
>                 URL: https://issues.apache.org/jira/browse/HAWQ-42
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: libhdfs
>            Reporter: Xiang Sheng
>            Assignee: Zhanwei Wang
>            Priority: Critical
>
> Running workload ( tpch_row_10g_nocompression_no_partition) on a 128 node cluster,  these queries (q1,q3,q4,q5,q6,w7,q8,q9,q10,q12,q14,q15,q17,q18,q19,q20,q21) failed out for query executor error and core dump.
> {noformat}
> (gdb) bt
> #0  0x000000350b40f5db in raise () from /lib64/libpthread.so.0
> #1  0x0000000000ac77fa in SafeHandlerForSegvBusIll (processName=<value optimized out>, postgres_signal_arg=7) at elog.c:4497
> #2  <signal handler called>
> #3  0x00007f1b445690c2 in _mm_crc32_u64 (this=0x261fcd0, b=0x7f1b0d6d7000, len=512) at /opt/gcc-4.4.2/lib/gcc/x86_64-unknown-linux-gnu/4.4.2/include/smmintrin.h:716
> #4  Hdfs::Internal::HWCrc32c::update (this=0x261fcd0, b=0x7f1b0d6d7000, len=512) at /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/common/HWCrc32c.cpp:114
> #5  0x00007f1b44549692 in Hdfs::Internal::LocalBlockReader::readAndVerify (this=0x26075a0, bufferSize=2097152) at /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/LocalBlockReader.cpp:174
> #6  0x00007f1b4454996f in Hdfs::Internal::LocalBlockReader::readInternal (this=0x26075a0, buf=0x3057b20 "Pb\370\003V\246X", len=<value optimized out>)
>     at /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/LocalBlockReader.cpp:227
> #7  0x00007f1b44549a13 in Hdfs::Internal::LocalBlockReader::read (this=0xffffffff, buf=0x7f1b0d6d7000 <Address 0x7f1b0d6d7000 out of bounds>, size=64)
>     at /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/LocalBlockReader.cpp:240
> #8  0x00007f1b4453bc3a in Hdfs::Internal::InputStreamImpl::readOneBlock (this=0x2768f20, buf=0x3057b20 "Pb\370\003V\246X", size=65536, shouldUpdateMetadataOnFailure=<value optimized out>)
>     at /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/InputStreamImpl.cpp:563
> #9  0x00007f1b4453c163 in Hdfs::Internal::InputStreamImpl::readInternal (this=0x2768f20, buf=0x3057b20 "Pb\370\003V\246X", size=65536) at /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/InputStreamImpl.cpp:666
> #10 0x00007f1b4453c5bb in Hdfs::Internal::InputStreamImpl::read (this=0x2768f20, buf=0x3057b20 "Pb\370\003V\246X", size=65536) at /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/InputStreamImpl.cpp:507
> #11 0x00007f1b44530e8c in hdfsRead (fs=<value optimized out>, file=<value optimized out>, buffer=0xffffffff, length=225275904) at /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/Hdfs.cpp:800
> #12 0x00007f1b2138ab7d in gpfs_hdfs_read (fcinfo=<value optimized out>) at gpfshdfs.c:492
> #13 0x000000000092b48b in HdfsRead (protocol=<value optimized out>, fileSystem=<value optimized out>, file=<value optimized out>, buffer=<value optimized out>, length=<value optimized out>) at filesystem.c:533
> #14 0x000000000091c385 in HdfsFileRead (file=6, buffer=0x3057b20 "Pb\370\003V\246X", amount=65536) at fd.c:2722
> #15 FileRead (file=6, buffer=0x3057b20 "Pb\370\003V\246X", amount=65536) at fd.c:3133
> #16 0x0000000000bcc416 in BufferedReadIo (bufferedRead=0x3009f08, newMaxReadAheadLen=<value optimized out>, growBufferLen=<value optimized out>, isUseSplitLen=<value optimized out>) at cdbbufferedread.c:198
> #17 BufferedReadUseBeforeBuffer (bufferedRead=0x3009f08, newMaxReadAheadLen=<value optimized out>, growBufferLen=<value optimized out>, isUseSplitLen=<value optimized out>) at cdbbufferedread.c:317
> #18 BufferedReadGrowBuffer (bufferedRead=0x3009f08, newMaxReadAheadLen=<value optimized out>, growBufferLen=<value optimized out>, isUseSplitLen=<value optimized out>) at cdbbufferedread.c:647
> #19 0x0000000000bc6b79 in AppendOnlyStorageRead_InternalGetBuffer (storageRead=0x3009eb8, isUseSplitLen=0 '\000') at cdbappendonlystorageread.c:1223
> #20 AppendOnlyStorageRead_GetBuffer (storageRead=0x3009eb8, isUseSplitLen=0 '\000') at cdbappendonlystorageread.c:1289
> #21 0x0000000000599a1e in AppendOnlyExecutorReadBlock_GetContents (scan=0x3009d98, direction=<value optimized out>, slot=0x2fdfed8) at appendonlyam.c:628
> #22 getNextBlock (scan=0x3009d98, direction=<value optimized out>, slot=0x2fdfed8) at appendonlyam.c:1243
> #23 appendonlygettup (scan=0x3009d98, direction=<value optimized out>, slot=0x2fdfed8) at appendonlyam.c:1283
> #24 appendonly_getnext (scan=0x3009d98, direction=<value optimized out>, slot=0x2fdfed8) at appendonlyam.c:1673
> #25 0x000000000075de16 in AppendOnlyScanNext (scanState=<value optimized out>) at execAOScan.c:39
> #26 0x0000000000751f1b in ExecScan (scanState=0x2ffea70) at execScan.c:129
> #27 ExecTableScanRelation (scanState=0x2ffea70) at execScan.c:441
> #28 0x0000000000788a73 in ExecTableScan (node=0x2ffea70) at nodeTableScan.c:42
> #29 0x00000000007469dd in ExecProcNode (node=0x2ffea70) at execProcnode.c:904
> #30 0x000000000077efe6 in execMotionSender (node=0x2ffd2d0) at nodeMotion.c:348
> #31 ExecMotion (node=0x2ffd2d0) at nodeMotion.c:315
> #32 0x0000000000746b71 in ExecProcNode (node=0x2ffd2d0) at execProcnode.c:999
> #33 0x000000000073a8ac in ExecutePlan (estate=0x274bb60, planstate=<value optimized out>, operation=<value optimized out>, numberTuples=<value optimized out>, direction=<value optimized out>, dest=<value optimized out>) at execMain.c:3181
> #34 0x000000000073b1f2 in ExecutorRun (queryDesc=<value optimized out>, direction=<value optimized out>, count=<value optimized out>) at execMain.c:1166
> #35 0x0000000000976ec9 in PortalRunSelect (portal=<value optimized out>, count=0, isTopLevel=<value optimized out>, dest=<value optimized out>, altdest=<value optimized out>, completionTag=<value optimized out>) at pquery.c:1641
> #36 PortalRun (portal=<value optimized out>, count=0, isTopLevel=<value optimized out>, dest=<value optimized out>, altdest=<value optimized out>, completionTag=<value optimized out>) at pquery.c:1463
> #37 0x000000000096f488 in exec_mpp_query (argc=<value optimized out>, argv=<value optimized out>, username=<value optimized out>) at postgres.c:1378
> #38 PostgresMain (argc=<value optimized out>, argv=<value optimized out>, username=<value optimized out>) at postgres.c:4866
> #39 0x00000000008cf51b in BackendRun (port=0x260d420) at postmaster.c:5844
> #40 BackendStartup (port=0x260d420) at postmaster.c:5437
> #41 0x00000000008d4fef in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:2139
> #42 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1431
> #43 0x00000000007d6aea in main (argc=9, argv=0x2609d20) at main.c:226
> (gdb) 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)