You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by GitBox <gi...@apache.org> on 2019/04/30 04:29:36 UTC
[GitHub] [incubator-doris] kangkaisen opened a new issue #1080: Restarting
BE lead to All Stream load Job failed
kangkaisen opened a new issue #1080: Restarting BE lead to All Stream load Job failed
URL: https://github.com/apache/incubator-doris/issues/1080
### Describe the bug
**Every time after restarting BE, the stream load job will fail**:
```
Status: Fail
Message: failed to send batch
NumberLoadedRows: 0
NumberFilteredRows: 0
LoadBytes: 18708516
LoadTimeMs: 100144
ErrorURL: null
Status: Fail
Message: failed to open tablet writer
```
### To Reproduce
Steps to reproduce the behavior:
1. running many stream load job
2. kill -9 restart all BE in one cluster
3. you should see the above error
### Additional context
Usually,this error will keep several minutes. So I have added a retry strategy and this issue not a big issue. But last night, **this error kept more than an hour**.
**The following is some related log in BE:**
```
W0429 21:47:00.231511 85103 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.152:8060
W0429 21:47:00.272903 85103 olap_table_sink.cpp:355] close_wait node channel failed, load_id=604297B000A3DE93:3DF96F1E3A3EB389, node=10.26.82.152:8060, errmsg=failed to send batch
W0429 21:47:00.277072 85301 stream_load.cpp:296] commit transaction failed, id=604297B000A3DE93:3DF96F1E3A3EB389, errmsg=transaction commit failed
W0429 21:47:00.279239 85301 stream_load.cpp:225] handle streaming load failed, id=604297B000A3DE93:3DF96F1E3A3EB389, errmsg=transaction commit failed
W0429 21:58:34.429118 44529 olap_table_sink.cpp:116] failed to open tablet writer, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.149:8060
W0429 21:58:34.457186 44529 olap_table_sink.cpp:289] tablet open failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, node=10.26.82.149:8060, errmsg=failed to open tablet writer
W0429 21:59:24.487653 44529 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.145:8060
W0429 21:59:24.518293 44529 olap_table_sink.cpp:355] close_wait node channel failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, node=10.26.82.145:8060, errmsg=failed to send batch
W0429 21:59:24.518378 44529 olap_table_sink.cpp:360] close_wait failed, load_id=FD435A82BACA5422:53D0158F40EFC49A
W0429 21:59:24.518388 44529 olap_table_sink.cpp:586] close channel failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, txn_id=192482768
W0429 21:59:24.519050 44529 stream_load.cpp:543] fragment execute failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, txn_id=192482768, query_id=E86270C7C4BC4259:A8E4D34ECEE0977D, errmsg=failed to send batch
W0429 21:59:24.519096 44720 stream_load.cpp:225] handle streaming load failed, id=FD435A82BACA5422:53D0158F40EFC49A, errmsg=failed to send batch
W0429 22:36:52.198593 85136 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached
timeout=50000ms @10.26.70.178:8060
I0429 22:36:52.198784 85136 status.cpp:54] failed to send batch
@ 0xb96ba9 doris::Status::Status()
@ 0x11dd838 doris::stream_load::NodeChannel::_wait_in_flight_packet()
@ 0x11dd903 doris::stream_load::NodeChannel::close_wait()
@ 0x11dee6f doris::stream_load::IndexChannel::close()
@ 0x11df2fe doris::stream_load::OlapTableSink::close()
@ 0xe405d5 doris::PlanFragmentExecutor::open_internal()
@ 0xe40af3 doris::PlanFragmentExecutor::open()
@ 0xde0077 doris::FragmentExecState::execute()
@ 0xde1a4c doris::FragmentMgr::exec_actual()
@ 0xde5034 boost::detail::function::void_function_obj_invoker0<>::invoke()
@ 0xda7ee8 doris::ThreadPool::work_thread()
@ 0x157ffbd thread_proxy
@ 0x7f4362940dd5 start_thread
@ 0x7f4362c52ead __clone
I0429 22:10:18.703799 81833 status.cpp:54] failed to open tablet writer
@ 0xb96ba9 doris::Status::Status()
@ 0x11dd105 doris::stream_load::NodeChannel::open_wait()
@ 0x11dd241 doris::stream_load::IndexChannel::open()
@ 0x11dd5ac doris::stream_load::OlapTableSink::open()
@ 0xe40127 doris::PlanFragmentExecutor::open_internal()
@ 0xe40af3 doris::PlanFragmentExecutor::open()
@ 0xde0077 doris::FragmentExecState::execute()
@ 0xde1a4c doris::FragmentMgr::exec_actual()
@ 0xde5034 boost::detail::function::void_function_obj_invoker0<>::invoke()
@ 0xda7ee8 doris::ThreadPool::work_thread()
@ 0x157ffbd thread_proxy
@ 0x7fc2bf49fdd5 start_thread
@ 0x7fc2bf7b1ead __clone
W0429 22:10:18.738099 81833 olap_table_sink.cpp:289] tablet open failed, load_id=5E4F37F7807C69E1:B9025008E1BE25A8, node=10.26.82.152:8060, errmsg=failed to open tablet writer
```
**My kafka to doris program also get the following error:**
```
rpc failed. backend id: 87716913
2019-04-29 21:58:34,167 ERROR [pool-2-thread-1] kafka.KafkaConsumerTask:248 : java.util.concurrent.TimeoutException: Read timeout
to gha-data-palo-query08.gh.sankuai.com/10.26.82.142:8040 after 60000 ms
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Read timeout to xxxx.com/xxxx:8040 after 60000 ms
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at org.asynchttpclient.netty.NettyResponseFuture.get(NettyResponseFuture.java:202)
at com.meituan.doris.doris.DorisHttpClient.sendDoris(DorisHttpClient.java:37)
```
**I found there are a lot of delete log and the disk.io.util is very high after restarting BE**:
```
I0429 22:20:03.368141 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515529.1907219336 transaction_id=192478116]
I0429 22:20:03.368486 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515537.1907219336 transaction_id=192478116]
I0429 22:20:03.368813 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515588.1907219336 transaction_id=192478116]
I0429 22:20:03.369206 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515595.1907219336 transaction_id=192478116]
I0429 22:20:03.369566 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515641.1907219336 transaction_id=192478116]
I0429 22:18:31.064805 81986 utils.cpp:1568] delete unused file. [file='/data2/olap/data/191/112665167/1509942097/112665167.hdr']
I0429 22:18:31.166507 81981 utils.cpp:1568] delete unused file. [file='/data8/olap/data/138/112651719/1509942097/112651719.hdr']
I0429 22:18:34.225603 81906 utils.cpp:1568] delete unused file. [file='/data11/olap/data/451/112729201/1509942097/112729201.hdr']
```
The following are some abnormal warn log:
```
I0429 22:58:02.467939 57825 olap_table.cpp:970] version_hash is same when publish version, publish success. [table=112752871.15099
42097 transaction_id=184325307]
W0429 22:58:02.467954 57825 olap_engine.cpp:961] can't publish version on tablet since data existed. [table=112752871.1509942097 t
ransaction_id=184325307 version=104]
I0429 22:58:02.469532 57825 olap_table.cpp:716] delete pending data from tablet. [table=112752871.1509942097 transaction_id=184325307]
I0429 22:58:02.469588 57825 olap_table.cpp:970] version_hash is same when publish version, publish success. [table=112752887.15099
42097 transaction_id=184325307]
W0429 22:58:02.469593 57825 olap_engine.cpp:961] can't publish version on tablet since data existed. [table=112752887.1509942097 t
ransaction_id=184325307 version=104]
I0429 22:58:02.470337 57825 olap_table.cpp:716] delete pending data from tablet. [table=112752887.1509942097 transaction_id=184325307]
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@doris.apache.org
For additional commands, e-mail: dev-help@doris.apache.org