You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2023/05/08 16:40:12 UTC

[GitHub] [pulsar] shodo created a discussion: Difference between JSON schema and AVRO schema

GitHub user shodo created a discussion: Difference between JSON schema and AVRO schema

When checking this section of the doc:
https://pulsar.apache.org/docs/3.0.x/schema-understand/

It's not really clear to me what is the difference between a JSON schema and an AVRO schema. 

When I talk about JSON schema I refer to this specification, basically a JSON to define how a JSON payload is composed:
http://json-schema.org/understanding-json-schema/index.html 

While with avro i refer to this one, a JSON to define how an AVRO payload is composed:
https://avro.apache.org/docs/

Both the schemas are written in JSON, but their specifications are quite different.

However If i check the Pulsar doc with example in C++ I see that in the AVRO example this string is passed:
```C++
static const std::string exampleSchema =
    "{\"type\":\"record\",\"name\":\"Example\",\"namespace\":\"test\","
    "\"fields\":[{\"name\":\"a\",\"type\":\"int\"},{\"name\":\"b\",\"type\":\"int\"}]}";
    Producer producer;
ProducerConfiguration producerConf;
producerConf.setSchema(SchemaInfo(AVRO, "Avro", exampleSchema));
```
while in the JSON example:
```C++
Std::string jsonSchema = R"({"type":"record","name":"cpx","fields":[{"name":"re","type":"double"},{"name":"im","type":"double"}]})";
SchemaInfo schemaInfo = SchemaInfo(JSON, "JSON", jsonSchema);
```

Although the two strings are instanced in different ways, the schema in the same way, and seems they are both respecting the AVRO specification!

So what's the point of saying that both JSON and AVRO are supported if seems that in both cases the AVRO specification is used? Am I missing something? 

GitHub link: https://github.com/apache/pulsar/discussions/20260

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org


[GitHub] [pulsar] shadygrove added a comment to the discussion: Difference between JSON schema and AVRO schema

Posted by "shadygrove (via GitHub)" <gi...@apache.org>.
GitHub user shadygrove added a comment to the discussion: Difference between JSON schema and AVRO schema

Yes, that is correct.  At least with the Go Client that has been my experience.  The **schema** checking happens at the server, but from what I can tell the **data** checking happens in the client library... so make sure to verify for the language/version you are using.

GitHub link: https://github.com/apache/pulsar/discussions/20260#discussioncomment-6091040

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org


[GitHub] [pulsar] shodo edited a comment on the discussion: Difference between JSON schema and AVRO schema

Posted by "shodo (via GitHub)" <gi...@apache.org>.
GitHub user shodo edited a comment on the discussion: Difference between JSON schema and AVRO schema

Thanks for your sharing @shadygrove !! 
So to recap, if you use avro schema (not Json) you'll obtain also kind of "runtime" validation cause the payload is directly serialized in binary avro following the given schema. Is It right? 
I mean it's not just a check of the "contract", since in our company we experienced the same issue: once the schema is validated we can send wrong data in the bus

GitHub link: https://github.com/apache/pulsar/discussions/20260#discussioncomment-6086905

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org


[GitHub] [pulsar] shodo edited a discussion: Difference between JSON schema and AVRO schema

Posted by GitBox <gi...@apache.org>.
GitHub user shodo edited a discussion: Difference between JSON schema and AVRO schema

When checking this section of the doc:
https://pulsar.apache.org/docs/3.0.x/schema-understand/

It's not really clear to me what is the difference between a JSON schema and an AVRO schema. 

When I talk about JSON schema I refer to this specification, basically a JSON to define how a JSON payload is composed:
http://json-schema.org/understanding-json-schema/index.html 

While with avro i refer to this one, a JSON to define how an AVRO payload is composed:
https://avro.apache.org/docs/

Both the schemas are written in JSON, but their specifications are quite different.

However If i check the Pulsar doc with example in C++ I see that in the AVRO example this string is passed:
```C++
static const std::string exampleSchema =
    "{\"type\":\"record\",\"name\":\"Example\",\"namespace\":\"test\","
    "\"fields\":[{\"name\":\"a\",\"type\":\"int\"},{\"name\":\"b\",\"type\":\"int\"}]}";
    Producer producer;
ProducerConfiguration producerConf;
producerConf.setSchema(SchemaInfo(AVRO, "Avro", exampleSchema));
```
while in the JSON example:
```C++
Std::string jsonSchema = R"({"type":"record","name":"cpx","fields":[{"name":"re","type":"double"},{"name":"im","type":"double"}]})";
SchemaInfo schemaInfo = SchemaInfo(JSON, "JSON", jsonSchema);
```

Although the two strings are instanced in different ways, the content is pretty similar, and seems they are both respecting the AVRO specification! The only real difference is that the AVRO one has the "namespace" field that is not present in the JSON example.

So what's the point of saying that both JSON and AVRO are supported if seems that in both cases the AVRO specification is used? Am I missing something? 

GitHub link: https://github.com/apache/pulsar/discussions/20260

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org


[GitHub] [pulsar] shodo edited a discussion: Difference between JSON schema and AVRO schema

Posted by GitBox <gi...@apache.org>.
GitHub user shodo edited a discussion: Difference between JSON schema and AVRO schema

When checking this section of the doc:
https://pulsar.apache.org/docs/3.0.x/schema-understand/

It's not really clear to me what is the difference between a JSON schema and an AVRO schema. 

When I talk about JSON schema I refer to this specification, basically a JSON to define how a JSON payload is composed:
http://json-schema.org/understanding-json-schema/index.html 

While with avro i refer to this one, a JSON to define how an AVRO payload is composed:
https://avro.apache.org/docs/

Both the schemas are written in JSON, but their specifications are quite different.

However If i check the Pulsar doc with example in C++ I see that in the AVRO example this string is passed:
```C++
static const std::string exampleSchema =
    "{\"type\":\"record\",\"name\":\"Example\",\"namespace\":\"test\","
    "\"fields\":[{\"name\":\"a\",\"type\":\"int\"},{\"name\":\"b\",\"type\":\"int\"}]}";
    Producer producer;
ProducerConfiguration producerConf;
producerConf.setSchema(SchemaInfo(AVRO, "Avro", exampleSchema));
```
while in the JSON example:
```C++
Std::string jsonSchema = R"({"type":"record","name":"cpx","fields":[{"name":"re","type":"double"},{"name":"im","type":"double"}]})";
SchemaInfo schemaInfo = SchemaInfo(JSON, "JSON", jsonSchema);
```

Although the two strings are instanced in different ways, the content is pretty similar, and seems they are both respecting the AVRO specification!

So what's the point of saying that both JSON and AVRO are supported if seems that in both cases the AVRO specification is used? Am I missing something? 

GitHub link: https://github.com/apache/pulsar/discussions/20260

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org


[GitHub] [pulsar] shadygrove added a comment to the discussion: Difference between JSON schema and AVRO schema

Posted by "shadygrove (via GitHub)" <gi...@apache.org>.
GitHub user shadygrove added a comment to the discussion: Difference between JSON schema and AVRO schema

I also have been trying to understand this, and find the documentation confusing.  

I am using the Golang Pulsar client and could not figure out why the `NewJSONSchema()` function would fail with a valid JSON Schema.  Then I find in the docs that the example for this actually uses AVRO JSON, and the Go client's source code uses `goavro` to validate the schema and serialize data which certainly fails.

My only conclusion is that I have no idea what the intent of JSON Schema in Pulsar actually is.   What am I missing?  Have I mis-understood the role and purpose of the JSON schema type?

GitHub link: https://github.com/apache/pulsar/discussions/20260#discussioncomment-6052509

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org


[GitHub] [pulsar] shodo added a comment to the discussion: Difference between JSON schema and AVRO schema

Posted by "shodo (via GitHub)" <gi...@apache.org>.
GitHub user shodo added a comment to the discussion: Difference between JSON schema and AVRO schema

Thanks for your sharing @shadygrove !! 
So to recap, if you use avro schema (not Json) you'll obtain also kind of "runtime" validation cause the payload is directly serialized in binary avro following the given schema. Is It right? 

GitHub link: https://github.com/apache/pulsar/discussions/20260#discussioncomment-6086905

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org


[GitHub] [pulsar] shadygrove added a comment to the discussion: Difference between JSON schema and AVRO schema

Posted by "shadygrove (via GitHub)" <gi...@apache.org>.
GitHub user shadygrove added a comment to the discussion: Difference between JSON schema and AVRO schema

Digging in a little further it seems that it is the difference in how data is stored in Pulsar.  Avro schema type will actually store the data as Avro binary format.  The JSON Schema type is STILL Avro, but the data is **stored as Avro JSON data instead of binary format**.  This can be useful for human readability, etc., but it is not technically JSON Schema.  It is still Avro.  

Also, when I use the JSON Schema type with Go Client library the validation check is simply doing an unmarshal of the data and if there are no errors it passes validation... it does not actually validate against the schema I pushed to the topic.  So I don't get data validation and can still push bad data to the topic under this scenario.

Not sure if this is a bug or a feature.  Just sharing my experience for anyone else out there wondering about this.

For me, Avro schema type seems to be the only way to get what I am looking for at this time... built-in validation before bad data gets to a topic.

GitHub link: https://github.com/apache/pulsar/discussions/20260#discussioncomment-6074776

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org


[GitHub] [pulsar] tisonkun added a comment to the discussion: Difference between JSON schema and AVRO schema

Posted by GitBox <gi...@apache.org>.
GitHub user tisonkun added a comment to the discussion: Difference between JSON schema and AVRO schema

cc @codelipenghui @congbobo184 may you have the motivation and use case here?

GitHub link: https://github.com/apache/pulsar/discussions/20260#discussioncomment-5845591

----
This is an automatically sent email for commits@pulsar.apache.org.
To unsubscribe, please send an email to: commits-unsubscribe@pulsar.apache.org