You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by GitBox <gi...@apache.org> on 2019/04/30 04:29:36 UTC

[GitHub] [incubator-doris] kangkaisen opened a new issue #1080: Restarting BE lead to All Stream load Job failed

kangkaisen opened a new issue #1080: Restarting BE lead to All Stream load Job failed
URL: https://github.com/apache/incubator-doris/issues/1080
 
 
   ### Describe the bug
   **Every time after restarting BE, the stream load job will fail**:
   ```
   Status: Fail
   Message: failed to send batch
   NumberLoadedRows: 0
   NumberFilteredRows: 0
   LoadBytes: 18708516
   LoadTimeMs: 100144
   ErrorURL: null
   
   Status: Fail
   Message: failed to open tablet writer
   ```
   
   ### To Reproduce
   Steps to reproduce the behavior:
   1.  running many stream load job
   2. kill -9 restart all BE in one cluster
   3. you should see the above error
   
   ### Additional context
   Usually,this error will keep several minutes. So I have added a retry strategy and this issue not a big issue. But last night, **this error kept  more than an hour**.
   
   **The following is some related log in BE:**
   ```
   W0429 21:47:00.231511 85103 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.152:8060
   W0429 21:47:00.272903 85103 olap_table_sink.cpp:355] close_wait node channel failed, load_id=604297B000A3DE93:3DF96F1E3A3EB389, node=10.26.82.152:8060, errmsg=failed to send batch
   W0429 21:47:00.277072 85301 stream_load.cpp:296] commit transaction failed, id=604297B000A3DE93:3DF96F1E3A3EB389, errmsg=transaction commit failed
   W0429 21:47:00.279239 85301 stream_load.cpp:225] handle streaming load failed, id=604297B000A3DE93:3DF96F1E3A3EB389, errmsg=transaction commit failed
   
   
   W0429 21:58:34.429118 44529 olap_table_sink.cpp:116] failed to open tablet writer, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.149:8060
   W0429 21:58:34.457186 44529 olap_table_sink.cpp:289] tablet open failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, node=10.26.82.149:8060, errmsg=failed to open tablet writer
   W0429 21:59:24.487653 44529 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached timeout=50000ms @10.26.82.145:8060
   W0429 21:59:24.518293 44529 olap_table_sink.cpp:355] close_wait node channel failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, node=10.26.82.145:8060, errmsg=failed to send batch
   W0429 21:59:24.518378 44529 olap_table_sink.cpp:360] close_wait failed, load_id=FD435A82BACA5422:53D0158F40EFC49A
   W0429 21:59:24.518388 44529 olap_table_sink.cpp:586] close channel failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, txn_id=192482768
   W0429 21:59:24.519050 44529 stream_load.cpp:543] fragment execute failed, load_id=FD435A82BACA5422:53D0158F40EFC49A, txn_id=192482768, query_id=E86270C7C4BC4259:A8E4D34ECEE0977D, errmsg=failed to send batch
   W0429 21:59:24.519096 44720 stream_load.cpp:225] handle streaming load failed, id=FD435A82BACA5422:53D0158F40EFC49A, errmsg=failed to send batch
   
   W0429 22:36:52.198593 85136 olap_table_sink.cpp:203] failed to send batch, error=RPC call is timed out, error_text=[E1008]Reached
   timeout=50000ms @10.26.70.178:8060
   I0429 22:36:52.198784 85136 status.cpp:54] failed to send batch
       @           0xb96ba9  doris::Status::Status()
       @          0x11dd838  doris::stream_load::NodeChannel::_wait_in_flight_packet()
       @          0x11dd903  doris::stream_load::NodeChannel::close_wait()
       @          0x11dee6f  doris::stream_load::IndexChannel::close()
       @          0x11df2fe  doris::stream_load::OlapTableSink::close()
       @           0xe405d5  doris::PlanFragmentExecutor::open_internal()
       @           0xe40af3  doris::PlanFragmentExecutor::open()
       @           0xde0077  doris::FragmentExecState::execute()
       @           0xde1a4c  doris::FragmentMgr::exec_actual()
       @           0xde5034  boost::detail::function::void_function_obj_invoker0<>::invoke()
       @           0xda7ee8  doris::ThreadPool::work_thread()
       @          0x157ffbd  thread_proxy
       @     0x7f4362940dd5  start_thread
       @     0x7f4362c52ead  __clone
   
   
   I0429 22:10:18.703799 81833 status.cpp:54] failed to open tablet writer
       @           0xb96ba9  doris::Status::Status()
       @          0x11dd105  doris::stream_load::NodeChannel::open_wait()
       @          0x11dd241  doris::stream_load::IndexChannel::open()
       @          0x11dd5ac  doris::stream_load::OlapTableSink::open()
       @           0xe40127  doris::PlanFragmentExecutor::open_internal()
       @           0xe40af3  doris::PlanFragmentExecutor::open()
       @           0xde0077  doris::FragmentExecState::execute()
       @           0xde1a4c  doris::FragmentMgr::exec_actual()
       @           0xde5034  boost::detail::function::void_function_obj_invoker0<>::invoke()
       @           0xda7ee8  doris::ThreadPool::work_thread()
       @          0x157ffbd  thread_proxy
       @     0x7fc2bf49fdd5  start_thread
       @     0x7fc2bf7b1ead  __clone
   W0429 22:10:18.738099 81833 olap_table_sink.cpp:289] tablet open failed, load_id=5E4F37F7807C69E1:B9025008E1BE25A8, node=10.26.82.152:8060, errmsg=failed to open tablet writer
   ```
   
   **My kafka to doris program also get the following error:**
   
   ```
   rpc failed. backend id: 87716913
   
   
   2019-04-29 21:58:34,167 ERROR [pool-2-thread-1] kafka.KafkaConsumerTask:248 : java.util.concurrent.TimeoutException: Read timeout
   to gha-data-palo-query08.gh.sankuai.com/10.26.82.142:8040 after 60000 ms
   java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Read timeout to xxxx.com/xxxx:8040 after 60000 ms
           at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
           at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
           at org.asynchttpclient.netty.NettyResponseFuture.get(NettyResponseFuture.java:202)
           at com.meituan.doris.doris.DorisHttpClient.sendDoris(DorisHttpClient.java:37)
   ```
   
   **I found there are a lot of delete log and the disk.io.util is very high after  restarting BE**:
   
   ```
   I0429 22:20:03.368141 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515529.1907219336 transaction_id=192478116]
   I0429 22:20:03.368486 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515537.1907219336 transaction_id=192478116]
   I0429 22:20:03.368813 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515588.1907219336 transaction_id=192478116]
   I0429 22:20:03.369206 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515595.1907219336 transaction_id=192478116]
   I0429 22:20:03.369566 81909 olap_table.cpp:716] delete pending data from tablet. [table=112515641.1907219336 transaction_id=192478116]
   
   
   I0429 22:18:31.064805 81986 utils.cpp:1568] delete unused file. [file='/data2/olap/data/191/112665167/1509942097/112665167.hdr']
   I0429 22:18:31.166507 81981 utils.cpp:1568] delete unused file. [file='/data8/olap/data/138/112651719/1509942097/112651719.hdr']
   I0429 22:18:34.225603 81906 utils.cpp:1568] delete unused file. [file='/data11/olap/data/451/112729201/1509942097/112729201.hdr']
   ```
   
   
   The following are some abnormal warn log:
   ```
   I0429 22:58:02.467939 57825 olap_table.cpp:970] version_hash is same when publish version, publish success. [table=112752871.15099
   42097 transaction_id=184325307]
   W0429 22:58:02.467954 57825 olap_engine.cpp:961] can't publish version on tablet since data existed. [table=112752871.1509942097 t
   ransaction_id=184325307 version=104]
   I0429 22:58:02.469532 57825 olap_table.cpp:716] delete pending data from tablet. [table=112752871.1509942097 transaction_id=184325307]
   I0429 22:58:02.469588 57825 olap_table.cpp:970] version_hash is same when publish version, publish success. [table=112752887.15099
   42097 transaction_id=184325307]
   W0429 22:58:02.469593 57825 olap_engine.cpp:961] can't publish version on tablet since data existed. [table=112752887.1509942097 t
   ransaction_id=184325307 version=104]
   I0429 22:58:02.470337 57825 olap_table.cpp:716] delete pending data from tablet. [table=112752887.1509942097 transaction_id=184325307]
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@doris.apache.org
For additional commands, e-mail: dev-help@doris.apache.org