You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@skywalking.apache.org by GitBox <gi...@apache.org> on 2019/03/01 11:46:08 UTC

[GitHub] wu-sheng commented on a change in pull request #39: add skywalking线上问题排查定位

wu-sheng commented on a change in pull request #39: add skywalking线上问题排查定位
URL: https://github.com/apache/incubator-skywalking-website/pull/39#discussion_r261573986
 
 

 ##########
 File path: docs/zh/blog/2019-03-01-skywalking-troubleshoot.md
 ##########
 @@ -0,0 +1,159 @@
+# 前言
+
+首先描述下问题的背景,博主有个习惯,每天上下班的时候看下skywalking的trace页面的error情况。但是某天突然发现生产环境skywalking页面没有任何数据了,页面也没有显示任何的异常,有点慌,我们线上虽然没有全面铺开对接skywalking,但是也有十多个应用。看了应用agent端日志后,其实也不用太担心,对应用毫无影响。大概情况就是这样,但是问题还是要解决,下面就开始排查skywalking不可用的问题。
+
+# 使用到的工具arthas
+
+Arthas是阿里巴巴开源的一款在线诊断java应用程序的工具,是greys工具的升级版本,深受开发者喜爱。当你遇到以下类似问题而束手无策时,Arthas可以帮助你解决:
+
+1.  这个类从哪个 jar 包加载的?为什么会报各种类相关的 Exception?
+2.  我改的代码为什么没有执行到?难道是我没 commit?分支搞错了?
+3.  遇到问题无法在线上 debug,难道只能通过加日志再重新发布吗?
+4.  线上遇到某个用户的数据处理有问题,但线上同样无法 debug,线下无法重现!
+5.  是否有一个全局视角来查看系统的运行状况?
+6.  有什么办法可以监控到JVM的实时运行状态?
+7.  Arthas采用命令行交互模式,同时提供丰富的 Tab 自动补全功能,进一步方便进行问题的定位和诊断。
+
+项目地址:https://github.com/alibaba/arthas
+
+# 先定位问题一
+
+查看skywalking-oap-server.log的日志,发现会有一条异常疯狂的在输出,异常详情如下:
+
+```
+2019-03-01 09:12:11,578 - org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker -3264081149 [DataCarrier.IndicatorPersistentWorker.endpoint_inventory.Consumser.0.Thread] ERROR [] - Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 684;
+org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 684;
+        at org.elasticsearch.action.ValidateActions.addValidationError(ValidateActions.java:26) ~[elasticsearch-6.3.2.jar:6.3.2]
+        at org.elasticsearch.action.index.IndexRequest.validate(IndexRequest.java:183) ~[elasticsearch-6.3.2.jar:6.3.2]
+        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:515) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
+        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:508) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
+        at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:348) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
+        at org.apache.skywalking.oap.server.library.client.elasticsearch.ElasticSearchClient.forceInsert(ElasticSearchClient.java:141) ~[library-client-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.RegisterEsDAO.forceInsert(RegisterEsDAO.java:66) ~[storage-elasticsearch-plugin-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker.lambda$onWork$0(RegisterPersistentWorker.java:83) ~[server-core-6.0.0-alpha.jar:6.0.0-alpha]
+        at java.util.HashMap$Values.forEach(HashMap.java:981) [?:1.8.0_201]
+        at org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker.onWork(RegisterPersistentWorker.java:74) [server-core-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker.access$100(RegisterPersistentWorker.java:35) [server-core-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker$PersistentConsumer.consume(RegisterPersistentWorker.java:120) [server-core-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.apm.commons.datacarrier.consumer.ConsumerThread.consume(ConsumerThread.java:101) [apm-datacarrier-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.apm.commons.datacarrier.consumer.ConsumerThread.run(ConsumerThread.java:68) [apm-datacarrier-6.0.0-alpha.jar:6.0.0-alpha]
+2019-03-01 09:12:11,627 - org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker -3264081198 [DataCarrier.IndicatorPersistentWorker.endpoint_inventory.Consumser.0.Thread] ERROR [] - Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 684;
+org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 684;
+        at org.elasticsearch.action.ValidateActions.addValidationError(ValidateActions.java:26) ~[elasticsearch-6.3.2.jar:6.3.2]
+        at org.elasticsearch.action.index.IndexRequest.validate(IndexRequest.java:183) ~[elasticsearch-6.3.2.jar:6.3.2]
+        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:515) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
+        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:508) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
+        at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:348) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
+        at org.apache.skywalking.oap.server.library.client.elasticsearch.ElasticSearchClient.forceInsert(ElasticSearchClient.java:141) ~[library-client-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.RegisterEsDAO.forceInsert(RegisterEsDAO.java:66) ~[storage-elasticsearch-plugin-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker.lambda$onWork$0(RegisterPersistentWorker.java:83) ~[server-core-6.0.0-alpha.jar:6.0.0-alpha]
+        at java.util.HashMap$Values.forEach(HashMap.java:981) [?:1.8.0_201]
+        at org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker.onWork(RegisterPersistentWorker.java:74) [server-core-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker.access$100(RegisterPersistentWorker.java:35) [server-core-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.oap.server.core.register.worker.RegisterPersistentWorker$PersistentConsumer.consume(RegisterPersistentWorker.java:120) [server-core-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.apm.commons.datacarrier.consumer.ConsumerThread.consume(ConsumerThread.java:101) [apm-datacarrier-6.0.0-alpha.jar:6.0.0-alpha]
+        at org.apache.skywalking.apm.commons.datacarrier.consumer.ConsumerThread.run(ConsumerThread.java:68) [apm-datacarrier-6.0.0-alpha.jar:6.0.0-alpha]
+```
+
+可以看到,上面的异常输出的时间节点,以这种频率在疯狂的刷新。通过异常message,得知到是因为skywalking在写elasticsearch时,索引的id太长了。下面是elasticsearch的源码:
+
+```java
+        if (id != null && id.getBytes(StandardCharsets.UTF_8).length > 512) {
+            validationException = addValidationError("id is too long, must be no longer than 512 bytes but was: " +
+                            id.getBytes(StandardCharsets.UTF_8).length, validationException);
+        }
+```
+
+具体可见:[elasticsearch/action/index/IndexRequest.java#L240](https://github.com/elastic/elasticsearch/blob/5.2/core/src/main/java/org/elasticsearch/action/index/IndexRequest.java#L240)
+
+## 问题一:
+
+通过日志,初步定位是哪个系统的url太长,skywalking在注册url数据时触发elasticsearch针对索引id校验的异常,而skywalking注册失败后会不断的重试,所以才有了上面日志不断刷的现象。
+
+## 问题解决:
+
+elasticsearch client在写es前通过硬编码的方式写死了索引id的长度不能超过512字节大小。也就是我们不能通过从ES侧找解决方案了。回到异常的message,只能看到提示id太长,并没有写明id具体是什么,这个异常提示其实是不合格的,博主觉得应该把id的具体内容抛出来,问题就简单了。因为异常没有明确提示,系统又比较多,不能十多个系统依次关闭重启来验证到底是哪个系统的哪个url有问题。这个时候Arthas就派上用场了,在不重启应用不开启debug模式下,查看实例中的属性对象。下面通过Arthas找到具体的url。
+
+从异常中得知,org.elasticsearch.action.index.IndexRequest这个类的validate方法触发的,这个方法是没有入参的,校验的id属性其实是对象本身的属性,那么我们使用Arthas的watch指令来看下这个实例id属性。先介绍下watch的用法:
+
+### 功能说明
+
+让你能方便的观察到指定方法的调用情况。能观察到的范围为:返回值、抛出异常、入参,通过编写 OGNL 表达式进行对应变量的查看。
+
+### 参数说明
+
+watch 的参数比较多,主要是因为它能在 4 个不同的场景观察对象
+
+| 参数名称 | 参数说明 |
+| --- | --- |
+| _class-pattern_ | 类名表达式匹配 |
+| _method-pattern_ | 方法名表达式匹配 |
+| _express_ | 观察表达式 |
+| _condition-express_ | 条件表达式 |
+| \[b\] | 在**方法调用之前**观察 |
+| \[e\] | 在**方法异常之后**观察 |
+| \[s\] | 在**方法返回之后**观察 |
+| \[f\] | 在**方法结束之后**(正常返回和异常返回)观察 |
+| \[E\] | 开启正则表达式匹配,默认为通配符匹配 |
+| \[x:\] | 指定输出结果的属性遍历深度,默认为 1 |
+
+从上面的用法说明结合异常信息,我们得到了如下的指令脚本:
+
+watch org.elasticsearch.action.index.IndexRequest validate "target"
+
+执行后,就看到了我们希望了解到的内容,如:
+
+![](https://oscimg.oschina.net/oscnet/f5009134737d16a4062223f35e862006274.jpg)
 
 Review comment:
   所有blog内容,图片,需要移动到官网目录

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services