You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@rocketmq.apache.org by GitBox <gi...@apache.org> on 2022/12/01 11:18:40 UTC

[GitHub] [rocketmq-connect] odbozhou opened a new issue, #385: Schema Inferencing for JsonConverter

odbozhou opened a new issue, #385:
URL: https://github.com/apache/rocketmq-connect/issues/385

   
   
   在很多情况,发送到RocketMQ Topic的消息是没有schema信息的,例如大部分业务消息,业务开发者之间一般通过接口文档的方式描述topic中的消息数据格式,这些数据一般是json,下游根据接口文档的描述对topic中的数据解析,转换成类对象。
   
   connector实现是通用的,通过简单的配置化即可使用,如果没有schema信息,sink connector可能很难解析topic中的数据。
   
   但是由于json数据的特殊性,我们可以根据json数据简单的反推数据的schema
   例如
   {
    "id": 10000,
    "name": "connector",
    "create_datetime": 1501834166000,
    "update_timestamp": 1501834166000
   }
   我们可以推断出对应的schema
   {
    "schema": {
    "type": "struct",
    "fields": [{
    "type": "int32",
    "optional": true,
    "field": "id"
    }, {
    "type": "string",
    "optional": true,
    "field": "name"
    }, {
    "type": "int64",
    "optional": false,
    "name": "org.apache.rocketmq.connect.data.Timestamp",
    "version": 1,
    "field": "create_datetime"
    }, {
    "type": "int64",
    "optional": false,
    "name": "org.apache.rocketmq.connect.data.Timestamp",
    "version": 1,
    "field": "update_timestamp"
    }],
    "optional": false,
    "name": "user"
    },
    "payload": {
    "id": 10000,
    "name": "connector",
    "create_datetime": 1501834166000,
    "update_timestamp": 1501834166000
   }
   }
   
   根据上面的例子,基本可以判断
   JsonConverter可以通过简单的数据推断出起对应你的schema
   推断出schema后对sink connector处理数据是十分有用的
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] joeCarf commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
joeCarf commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1334913563

   hi, 我想知道我目前应该重点关注哪些文件呢?以及目前仓库中有哪些我可以参考的内容吗
   
   > > and could you plz assign this issue to me? thanks a lot @odbozhou
   > 
   > If you encounter any problems now, you can raise them in this issue, and I can provide some targeted help
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] odbozhou commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
odbozhou commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1338676825

   > 经过熟悉JsonConverter这部分的代码,我理解我之后的工作应该集中于这部分 ![1669999190813](https://user-images.githubusercontent.com/52153761/205341732-8fae4678-cb45-4711-91fb-1b3b059f5719.jpg) 当schemasEnabled()为false时,应该进行json解析以生成schema。
   > 
   > 这个思路和方向是否正确?麻烦您有时间点拨一下,thx~ @odbozhou
   
   应该是schemaEnabled为true时,并且ConnectRecord中没有schema为空时需要根据json node type推断数据类型


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] odbozhou commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
odbozhou commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1354403153

   > hi,我大概整理出了一个设计流程图,想和您讨论一下: ![Schema Inferencing for JsonConverter](https://user-images.githubusercontent.com/52153761/206893015-dc5e386d-a649-4175-89be-52af21250928.png)
   > 
   > 由于单纯的json数据中所携带的信息量比较少,只能通过人为制定规则来进行填充schema;我制定了一些规则:
   > 
   > 1. 对于schema里的optional字段,直接置为false
   > 2. 对于schema的name字段,默认置为“default”或者null
   > 3. 当原json中有多个数据的时候,schema的type置为“struct”;当json中只有一个数据时,schema的type使用json数据的type
   > 4. 对于schema中的name、version等字段,不好进行推断,不知道您是否有一些建议
   > 5. 对于某个数据字段的type,我只能通过JSONObject.value的所属类来判断,例如{"id": 10000}中的value instanceof Integer,因此认为schema.type=int32。这样做的弊端在于比较粗略,不能保证推断出来的type是完全正确的,同时对于一些特殊的字段类型如struct、map、date等处理起来可能会比较麻烦
   
   很详细。由于我对这个问题理解的一些偏差,可能前面描述有一些误差。schemaEnabled true或者false,只要schema缺失都可以对json数据进行类型推断。
   
   另外我仔细看了下这块代码,之前社区同学已经把这块逻辑实现了。
   可以看一下  org.apache.rocketmq.connect.runtime.converter.record.json.JsonConverter#convertToConnect
   可以在看看还有哪块特性感兴趣


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] odbozhou commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
odbozhou commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1334908091

   > and could you plz assign this issue to me? thanks a lot @odbozhou
   
   If you encounter any problems now, you can raise them in this issue, and I can provide some targeted help


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] sunxiaojian closed issue #385: Schema Inferencing for JsonConverter

Posted by "sunxiaojian (via GitHub)" <gi...@apache.org>.
sunxiaojian closed issue #385: Schema Inferencing for JsonConverter
URL: https://github.com/apache/rocketmq-connect/issues/385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] joeCarf commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
joeCarf commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1345489032

   hi,我大概整理出了一个设计流程图,想和您讨论一下:
   ![Schema Inferencing for JsonConverter](https://user-images.githubusercontent.com/52153761/206893015-dc5e386d-a649-4175-89be-52af21250928.png)
   
   由于单纯的json数据中所携带的信息量比较少,只能通过人为制定规则来进行填充schema;我制定了一些规则:
   1. 对于schema里的optional字段,直接置为false
   2. 对于schema的name字段,默认置为“default”或者null
   3. 当原json中有多个数据的时候,schema的type置为“struct”;当json中只有一个数据时,schema的type使用json数据的type
   4. 对于schema中的name、version等字段,不好进行推断,不知道您是否有一些建议
   5. 对于某个数据字段的type,我只能通过JSONObject.value的所属类来判断,例如{"id": 10000}中的value instanceof Integer,因此认为schema.type=int32。这样做的弊端在于比较粗略,不能保证推断出来的type是完全正确的,同时对于一些特殊的字段类型如struct、map、date等处理起来可能会比较麻烦
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] joeCarf commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
joeCarf commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1333702527

   > 在很多情况,发送到RocketMQ Topic的消息是没有schema信息的,例如大部分业务消息,业务开发者之间一般通过接口文档的方式描述topic中的消息数据格式,这些数据一般是json,下游根据接口文档的描述对topic中的数据解析,转换成类对象。
   > 
   > connector实现是通用的,通过简单的配置化即可使用,如果没有schema信息,sink connector可能很难解析topic中的数据。
   > 
   > 但是由于json数据的特殊性,我们可以根据json数据简单的反推数据的schema 例如 { "id": 10000, "name": "connector", "create_datetime": 1501834166000, "update_timestamp": 1501834166000 } 我们可以推断出对应的schema { "schema": { "type": "struct", "fields": [{ "type": "int32", "optional": true, "field": "id" }, { "type": "string", "optional": true, "field": "name" }, { "type": "int64", "optional": false, "name": "org.apache.rocketmq.connect.data.Timestamp", "version": 1, "field": "create_datetime" }, { "type": "int64", "optional": false, "name": "org.apache.rocketmq.connect.data.Timestamp", "version": 1, "field": "update_timestamp" }], "optional": false, "name": "user" }, "payload": { "id": 10000, "name": "connector", "create_datetime": 1501834166000, "update_timestamp": 1501834166000 } }
   > 
   > 根据上面的例子,基本可以判断 JsonConverter可以通过简单的数据推断出起对应你的schema 推断出schema后对sink connector处理数据是十分有用的
   
   hi, I'd liked to do it. But i may need some help. Where should i start from?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] joeCarf commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
joeCarf commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1333873642

   and could you plz assign this issue to me? thanks a lot @odbozhou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] joeCarf commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
joeCarf commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1345489234

   > 
   
   希望您能给一些建议~ @odbozhou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] joeCarf commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
joeCarf commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1333697722

   我希望能领取这个task。我应该从哪部分代码入手呢?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq-connect] joeCarf commented on issue #385: Schema Inferencing for JsonConverter

Posted by GitBox <gi...@apache.org>.
joeCarf commented on issue #385:
URL: https://github.com/apache/rocketmq-connect/issues/385#issuecomment-1335518987

   经过熟悉JsonConverter这部分的代码,我理解我之后的工作应该集中于这部分
   ![1669999190813](https://user-images.githubusercontent.com/52153761/205341732-8fae4678-cb45-4711-91fb-1b3b059f5719.jpg)
   当schemasEnabled()为false时,应该进行json解析以生成schema。
   
   这个思路和方向是否正确?麻烦您有时间点拨一下,thx~ @odbozhou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org