You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@inlong.apache.org by do...@apache.org on 2021/07/28 09:00:18 UTC

[incubator-inlong-website] branch master updated: [INLONG-798] dataproxy add pic and instruction on configuration in readme (#116)

This is an automated email from the ASF dual-hosted git repository.

dockerzhang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-inlong-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 1a42972  [INLONG-798] dataproxy add pic and instruction on configuration in readme (#116)
1a42972 is described below

commit 1a4297276255b99edf003015fc90a2598c2e7361
Author: ziruipeng <zp...@connect.ust.hk>
AuthorDate: Wed Jul 28 17:00:09 2021 +0800

    [INLONG-798] dataproxy add pic and instruction on configuration in readme (#116)
    
    Co-authored-by: stingpeng <st...@tencent.com>
---
 docs/en-us/modules/dataproxy/architecture.md      | 143 ++++++++++++++++++++-
 docs/en-us/modules/dataproxy/img/architecture.png | Bin 0 -> 431999 bytes
 docs/zh-cn/modules/dataproxy/architecture.md      | 146 +++++++++++++++++++++-
 docs/zh-cn/modules/dataproxy/img/architecture.png | Bin 0 -> 431999 bytes
 4 files changed, 282 insertions(+), 7 deletions(-)

diff --git a/docs/en-us/modules/dataproxy/architecture.md b/docs/en-us/modules/dataproxy/architecture.md
index 2fe2abf..8e6d2ef 100644
--- a/docs/en-us/modules/dataproxy/architecture.md
+++ b/docs/en-us/modules/dataproxy/architecture.md
@@ -1,11 +1,150 @@
 # 1、intro
 
-    Inlong-bus belongs to the inlong proxy layer and is used for data collection, reception and forwarding. Through format conversion, the data is converted into TDMsg1 format that can be cached and processed by the cache layer
-    The overall architecture of inlong-bus is based on Apache Flume. On the basis of this project, inlong-bus expands the source layer and sink layer, and optimizes disaster tolerance forwarding, which improves the stability of the system.
+    Inlong-dataProxy belongs to the inlong proxy layer and is used for data collection, reception and forwarding. Through format conversion, the data is converted into TDMsg1 format that can be cached and processed by the cache layer
+    InLong-dataProxy acts as a bridge from the InLong collection end to the InLong buffer end. Dataproxy pulls the relationship between the business id and the corresponding topic name from the manager module, and internally manages the producers of multiple topics
+    The overall architecture of inlong-dataproxy is based on Apache Flume. On the basis of this project, inlong-bus expands the source layer and sink layer, and optimizes disaster tolerance forwarding, which improves the stability of the system.
 
 
 # 2、architecture
 
+![](img/architecture.png)
+
  	1. The source layer opens port monitoring, which is realized through netty server. The decoded data is sent to the channel layer
  	2. The channel layer has a selector, which is used to choose which type of channel to go. If the memory is eventually full, the data will be processed.
  	3. The data of the channel layer will be forwarded through the sink layer. The main purpose here is to convert the data to the TDMsg1 format and push it to the cache layer (tube is more commonly used here)
+
+# 3、DataProxy support configuration instructions
+
+DataProxy supports configurable source-channel-sink, and the configuration method is the same as the configuration file structure of flume:
+
+Source configuration example and corresponding notes:
+
+    agent1.sources.tcp-source.channels = ch-msg1 ch-msg2 ch-msg3 ch-more1 ch-more2 ch-more3 ch-msg5 ch-msg6 ch-msg7 ch-msg8 ch-msg9 ch-msg10 ch-transfer ch -Back
+    Define the channel used in the source. Note that if the configuration below this source uses the channel, it needs to be annotated here
+
+    agent1.sources.tcp-source.type = org.apache.flume.source.SimpleTcpSource
+    tcp resolution type definition, here provide the class name for instantiation, SimpleTcpSource is mainly to initialize the configuration and start port monitoring
+
+    agent1.sources.tcp-source.msg-factory-name = org.apache.flume.source.ServerMessageFactory
+    Handler used for message structure analysis, and set read stream handler and write stream handler, netty concept, please refer to: https://blog.csdn.net/u013252773/article/details/21195593
+
+
+    agent1.sources.tcp-source.host = 0.0.0.0
+    tcp ip binding monitoring, binding all network cards by default
+
+    agent1.sources.tcp-source.port = 46801
+    tcp port binding, port 46801 is bound by default
+
+    agent1.sources.tcp-source.highWaterMark=2621440
+    The concept of netty, set the netty high water level value
+
+    agent1.sources.tcp-source.enableExceptionReturn=true
+    The new function of v1.7 version, optional, the default is false, used to open the exception channel, when an exception occurs, the data is written to the exception channel to prevent other normal data transmission (the open source version does not add this function), Details: Increase the local disk of abnormal data landing
+
+    agent1.sources.tcp-source.max-msg-length = 524288
+    Limit the size of a single package, here if the compressed package is transmitted, it is the compressed package size, the limit is 512KB
+
+    agent1.sources.tcp-source.topic = test_token
+    The default topic value, if the mapping relationship between bid and topic cannot be found, it will be sent to this topic
+
+    agent1.sources.tcp-source.attr = m=9
+    The default value of m is set, where the value of m is the version of inlong's internal TdMsg protocol
+
+    agent1.sources.tcp-source.connections = 5000
+    Concurrent connections go online, new connections will be broken when the upper limit is exceeded
+
+    agent1.sources.tcp-source.max-threads = 64
+    Netty thread pool work thread upper limit, generally recommended to choose twice the cpu
+
+    agent1.sources.tcp-source.receiveBufferSize = 524288
+    Netty server tcp tuning parameters, please refer to: https://stackoverflow.com/questions/4257410/what-are-so-sndbuf-and-so-rcvbuf
+    
+    agent1.sources.tcp-source.sendBufferSize = 524288
+    Netty server tcp tuning parameters, please refer to: https://stackoverflow.com/questions/4257410/what-are-so-sndbuf-and-so-rcvbuf
+
+    agent1.sources.tcp-source.custom-cp = true
+    Whether to use the self-developed channel process, the self-developed channel process can select the alternate channel to send when the main channel is blocked
+
+    agent1.sources.tcp-source.selector.type = org.apache.flume.channel.FailoverChannelSelector
+    This channel selector is a self-developed channel selector, which is not much different from the official website, mainly because of the channel master-slave selection logic
+
+    agent1.sources.tcp-source.selector.master = ch-msg5 ch-msg6 ch-msg7 ch-msg8 ch-msg9
+    Specify the master channel, these channels will be preferentially selected for data push. Those channels that are not in the master, transfer, fileMetric, and slaMetric configuration items, but are in
+    There are defined channels in channels, which are all classified as slave channels. When the master channel is full, the slave channel will be selected. Generally, the file channel type is recommended for the slave channel.
+
+    agent1.sources.tcp-source.selector.transfer = ch-msg5 ch-msg6 ch-msg7 ch-msg8 ch-msg9
+    Specify the transfer channel to accept the transfer type data. The transfer here generally refers to the data pushed to the non-tube cluster, which is only for forwarding, and it is reserved for subsequent functions.
+
+    agent1.sources.tcp-source.selector.fileMetric = ch-back
+    Specify the fileMetric channel to receive the metric data reported by the agent
+
+
+Channel configuration examples and corresponding annotations
+
+memory channel
+
+    agent1.channels.ch-more1.type = memory
+    memory channel type
+
+    agent1.channels.ch-more1.capacity = 10000000
+    Memory channel queue size, the maximum number of messages that can be cached
+
+    agent1.channels.ch-more1.keep-alive = 0
+    
+    agent1.channels.ch-more1.transactionCapacity = 20
+    The maximum number of batches are processed in atomic operations, and the memory channel needs to be locked when used, so there will be a batch process to increase efficiency
+
+file channel
+
+    agent1.channels.ch-msg5.type = file
+    file channel type
+
+    agent1.channels.ch-msg5.capacity = 100000000
+    The maximum number of messages that can be cached in a file channel
+
+    agent1.channels.ch-msg5.maxFileSize = 1073741824
+    file channel file maximum limit, the number of bytes
+
+    agent1.channels.ch-msg5.minimumRequiredSpace = 1073741824
+    The minimum free space of the disk where the file channel is located. Setting this value can prevent the disk from being full
+
+    agent1.channels.ch-msg5.checkpointDir = /data/work/file/ch-msg5/check
+    file channel checkpoint path
+
+    agent1.channels.ch-msg5.dataDirs = /data/work/file/ch-msg5/data
+    file channel data path
+
+    agent1.channels.ch-msg5.fsyncPerTransaction = false
+    Whether to synchronize the disk for each atomic operation, it is recommended to change it to false, otherwise it will affect the performance
+
+    agent1.channels.ch-msg5.fsyncInterval = 5
+    The time interval between data flush from memory to disk, in seconds
+
+Sink configuration example and corresponding notes
+
+    agent1.sinks.meta-sink-more1.channel = ch-msg1
+    The upstream channel name of the sink
+
+    agent1.sinks.meta-sink-more1.type = org.apache.flume.sink.MetaSink
+    The sink class is implemented, where the message is implemented to push data to the tube cluster
+
+    agent1.sinks.meta-sink-more1.master-host-port-list = sk-tegspider-tube-master-1:8609,sk-tegspider-tube-master-2:8609
+    Tube cluster master node list
+
+    agent1.sinks.meta-sink-more1.send_timeout = 30000
+    Timeout limit when sending to tube
+
+    agent1.sinks.meta-sink-more1.stat-interval-sec = 60
+    Sink indicator statistics interval time, in seconds
+
+    agent1.sinks.meta-sink-more1.thread-num = 8
+    Sink class sends messages to the worker thread, 8 means to start 8 concurrent threads
+
+    agent1.sinks.meta-sink-more1.client-id-cache = true
+    agent id cache, used to check the data reported by the agent to remove duplicates
+
+    agent1.sinks.meta-sink-more1.max-survived-time = 300000
+    Maximum cache time
+    
+    agent1.sinks.meta-sink-more1.max-survived-size = 3000000
+    Maximum number of caches
diff --git a/docs/en-us/modules/dataproxy/img/architecture.png b/docs/en-us/modules/dataproxy/img/architecture.png
new file mode 100644
index 0000000..bc46026
Binary files /dev/null and b/docs/en-us/modules/dataproxy/img/architecture.png differ
diff --git a/docs/zh-cn/modules/dataproxy/architecture.md b/docs/zh-cn/modules/dataproxy/architecture.md
index c3a6901..7cf7770 100644
--- a/docs/zh-cn/modules/dataproxy/architecture.md
+++ b/docs/zh-cn/modules/dataproxy/architecture.md
@@ -1,11 +1,147 @@
 # 一、说明
 
-	inlong-bus属于inlong proxy层,用于数据的汇集接收以及转发。通过格式转换,将数据转为cache层可以缓存处理的TDMsg1格式
-	inlong-bus整体架构基于Apache Flume。inlong-bus在该项目的基础上,扩展了source层和sink层,并对容灾转发做了优化处理,提升了系统的稳定性。
-
-
+	InLong-dataProxy属于inlong proxy层,用于数据的汇集接收以及转发。通过格式转换,将数据转为cache层可以缓存处理的TDMsg1格式
+    InLong-dataProxy充当了InLong采集端到InLong缓冲端的桥梁,dataproxy从manager模块拉取业务id与对应topic名称的关系,内部管理多个topic的生产者
+	当dataproxy收到消息时,会首先缓存到本地的Channel中,并使用本地的producer往后端即cache层发送数据
+    InLong-dataProxy整体架构基于Apache Flume。inlong-dataproxy在该项目的基础上,扩展了source层和sink层,并对容灾转发做了优化处理,提升了系统的稳定性。
+    
+    
 # 二、架构
 
+![](img/architecture.png)
+
  	1.Source层开启端口监听,通过netty server实现。解码之后的数据发到channel层
  	2.channel层有一个selector,用于选择走哪种类型的channel,如果memory最终满了,会对数据做落地处理
- 	3.channel层的数据会通过sink层做转发,这里主要是将数据转为TDMsg1的格式,并推送到cache层(这里用的比较多的是tube)
\ No newline at end of file
+ 	3.channel层的数据会通过sink层做转发,这里主要是将数据转为TDMsg1的格式,并推送到cache层(这里用的比较多的是tube)
+
+
+# 三、DataProxy功能配置说明
+
+DataProxy支持配置化的source-channel-sink,配置方式与flume的配置文件结构相同:
+
+Source配置示例以及对应的注解:
+
+    agent1.sources.tcp-source.channels = ch-msg1 ch-msg2 ch-msg3 ch-more1 ch-more2 ch-more3 ch-msg5 ch-msg6 ch-msg7 ch-msg8 ch-msg9 ch-msg10 ch-transfer ch-back
+    定义source中使用到的channel,注意此source下面的配置如果有使用到channel,均需要在此注释
+
+    agent1.sources.tcp-source.type = org.apache.flume.source.SimpleTcpSource
+    tcp解析类型定义,这里提供类名用于实例化,SimpleTcpSource主要是初始化配置并启动端口监听
+
+    agent1.sources.tcp-source.msg-factory-name = org.apache.flume.source.ServerMessageFactory
+    用于构造消息解析的handler,并设置read stream handler和write stream handler,netty概念,具体可参考:https://blog.csdn.net/u013252773/article/details/21195593
+
+    agent1.sources.tcp-source.host = 0.0.0.0    
+    tcp ip绑定监听,默认绑定所有网卡
+
+    agent1.sources.tcp-source.port = 46801
+    tcp 端口绑定,默认绑定46801端口
+
+    agent1.sources.tcp-source.highWaterMark=2621440 
+    netty概念,设置netty高水位值,高水位详细解答可参考:https://www.jianshu.com/p/c9e9a5a11943
+
+    agent1.sources.tcp-source.max-msg-length = 524288
+    限制单个包大小,这里如果传输的是压缩包,则是压缩包大小,限制512KB
+
+    agent1.sources.tcp-source.topic = test_token
+    默认topic值,如果bid和topic的映射关系找不到,则发送到此topic中
+
+    agent1.sources.tcp-source.attr = m=9
+    默认m值设置,这里的m值是inlong内部TdMsg协议的版本
+
+    agent1.sources.tcp-source.connections = 5000
+    并发连接上线,超过上限值时会对新连接做断链处理
+
+    agent1.sources.tcp-source.max-threads = 64
+    netty线程池工作线程上限,一般推荐选择cpu的两倍
+
+    agent1.sources.tcp-source.receiveBufferSize = 524288
+    netty server tcp调优参数,具体解释可参考:https://stackoverflow.com/questions/4257410/what-are-so-sndbuf-and-so-rcvbuf
+    
+    agent1.sources.tcp-source.sendBufferSize = 524288
+    netty server tcp调优参数,具体解释可参考:https://stackoverflow.com/questions/4257410/what-are-so-sndbuf-and-so-rcvbuf
+
+    agent1.sources.tcp-source.custom-cp = true
+    是否使用自研的channel process,自研channel process可在主channel阻塞时,选择备用channel发送
+
+    agent1.sources.tcp-source.selector.type = org.apache.flume.channel.FailoverChannelSelector
+    这个channel selector就是自研的channel selector,和官网的差别不大,主要是有channel主从选择逻辑
+
+    agent1.sources.tcp-source.selector.master = ch-msg5 ch-msg6 ch-msg7 ch-msg8 ch-msg9
+    指定master channel,这些channel会被优先选择用于数据推送。那些不在master、transfer、fileMetric、slaMetric配置项里的channel,但在
+    channels里面有定义的channel,统归为slave channel,当master channel都被占满时,就会选择使用slave channel,slave channel一般建议使用file channel类型
+
+    agent1.sources.tcp-source.selector.transfer = ch-msg5 ch-msg6 ch-msg7 ch-msg8 ch-msg9
+    指定transfer channel,承接transfer类型的数据,这里的transfer一般是指推送到非tube集群的数据,仅做转发,这里预留出来供后续功能使用
+
+    agent1.sources.tcp-source.selector.fileMetric = ch-back
+    指定fileMetric channel,用于接收agent上报的指标数据
+
+Channel配置示例以及对应的注解
+
+memory channel
+
+    agent1.channels.ch-more1.type = memory
+    memory channel类型
+
+    agent1.channels.ch-more1.capacity = 10000000
+    memory channel 队列大小,可缓存最大消息条数
+
+    agent1.channels.ch-more1.keep-alive = 0
+    
+    agent1.channels.ch-more1.transactionCapacity = 20
+    原子操作时批量处理最大条数,memory channel使用时需要用到加锁,因此会有批处理流程增加效率
+
+file channel
+
+    agent1.channels.ch-msg5.type = file
+    file channel类型
+
+    agent1.channels.ch-msg5.capacity = 100000000
+    file channel最大可缓存消息条数
+
+    agent1.channels.ch-msg5.maxFileSize = 1073741824
+    file channel文件最大上限,字节数
+
+    agent1.channels.ch-msg5.minimumRequiredSpace = 1073741824
+    file channel所在磁盘最小可用空间,设置此值可以防止磁盘写满
+
+    agent1.channels.ch-msg5.checkpointDir = /data/work/file/ch-msg5/check
+    file channel checkpoint路径
+
+    agent1.channels.ch-msg5.dataDirs = /data/work/file/ch-msg5/data
+    file channel数据路径
+
+    agent1.channels.ch-msg5.fsyncPerTransaction = false
+    是否对每个原子操作做同步磁盘,建议改false,否则会对性能有影响
+
+    agent1.channels.ch-msg5.fsyncInterval = 5
+    数据从内存flush到磁盘的时间间隔,单位秒
+
+Sink配置示例以及对应的注解
+
+    agent1.sinks.meta-sink-more1.channel = ch-msg1
+    sink的上游channel名称
+
+    agent1.sinks.meta-sink-more1.type = org.apache.flume.sink.MetaSink
+    sink类实现,此处实现消息向tube集群推送数据
+
+    agent1.sinks.meta-sink-more1.master-host-port-list = sk-tegspider-tube-master-1:8609,sk-tegspider-tube-master-2:8609
+    tube集群master节点列表
+
+    agent1.sinks.meta-sink-more1.send_timeout = 30000
+    发送到tube时超时时间限制
+
+    agent1.sinks.meta-sink-more1.stat-interval-sec = 60
+    sink指标统计间隔时间,单位秒
+
+    agent1.sinks.meta-sink-more1.thread-num = 8
+    Sink类发送消息的工作线程,8表示启动8个并发线程
+
+    agent1.sinks.meta-sink-more1.client-id-cache = true
+    agent id缓存,用于检查agent上报数据去重
+
+    agent1.sinks.meta-sink-more1.max-survived-time = 300000
+    缓存最大时间
+    
+    agent1.sinks.meta-sink-more1.max-survived-size = 3000000
+    缓存最大个数
diff --git a/docs/zh-cn/modules/dataproxy/img/architecture.png b/docs/zh-cn/modules/dataproxy/img/architecture.png
new file mode 100644
index 0000000..bc46026
Binary files /dev/null and b/docs/zh-cn/modules/dataproxy/img/architecture.png differ