You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by "BewareMyPower (via GitHub)" <gi...@apache.org> on 2023/05/22 12:12:35 UTC

[GitHub] [pulsar-client-python] BewareMyPower opened a new pull request, #119: Fetch writer schema to decode Avro messages

BewareMyPower opened a new pull request, #119:
URL: https://github.com/apache/pulsar-client-python/pull/119

   Fixes https://github.com/apache/pulsar-client-python/issues/108
   
   ### Motivation
   
   Currently the Python client uses the reader schema, which is the schema of the consumer, to decode Avro messages. However, when the writer schema is different from the reader schema, the decode will fail.
   
   ### Modifications
   
   Add `attach_client` method to `Schema` and call it when creating consumers and readers. This method stores a reference to a `_pulsar.Client` instance, which leverages the C++ APIs added in https://github.com/apache/pulsar-client-cpp/pull/257 to fetch schema info. The `AvroSchema` class fetches and caches the writer schema if it is not cached, then use both the writer schema and reader schema to decode messages.
   
   Add `test_schema_evolve` to test consumers or readers can decode any message whose writer schema is different with the reader schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-client-python] shibd merged pull request #119: Fetch writer schema to decode Avro messages

Posted by "shibd (via GitHub)" <gi...@apache.org>.

shibd merged PR #119:
URL: https://github.com/apache/pulsar-client-python/pull/119


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-client-python] BewareMyPower commented on pull request #119: Fetch writer schema to decode Avro messages

Posted by "BewareMyPower (via GitHub)" <gi...@apache.org>.

BewareMyPower commented on PR #119:
URL: https://github.com/apache/pulsar-client-python/pull/119#issuecomment-1560815081

   > Use this patch. Although flowing define will create two schemas, that's okay, right? It will use write schema of writing that message to deserialize the data.
   
   Yes, it will create two schemas. But it will cause breaking changes. If we have ways to avoid the breaking changes, maybe we don't need to make these changes. Or we can make the changes in the next release after the discussion in the mail list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-client-python] BewareMyPower commented on a diff in pull request #119: Fetch writer schema to decode Avro messages

Posted by "BewareMyPower (via GitHub)" <gi...@apache.org>.

BewareMyPower commented on code in PR #119:
URL: https://github.com/apache/pulsar-client-python/pull/119#discussion_r1203838596


##########
pulsar/schema/schema_avro.py:
##########
@@ -40,6 +42,8 @@ def __init__(self, record_cls, schema_definition=None):
                 self._schema = record_cls.schema()
             else:
                 self._schema = schema_definition
+            self._writer_schemas = dict()

Review Comment:
   `fastavro` uses different APIs with the Java Avro library.
   
   https://fastavro.readthedocs.io/en/latest/reader.html?highlight=schemaless_reader#fastavro._read_py.schemaless_reader
   
   - writer_schema – Schema used when calling schemaless_writer
   - reader_schema – If the schema has changed since being written then the new schema can be given to allow for schema migration
   
   The downloaded schema should be the writer schema because Python producers pass that schema to `schemaless_writer`.
   
   As for the Pulsar Java client, it seems that your understanding is wrong.
   
   https://github.com/apache/pulsar/blob/b31c5a6a325728b5dc5faebd1a33386952d733d5/pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/reader/MultiVersionAvroReader.java#L51-L52
   
   ```java
               return new AvroReader<>(parseAvroSchema(schemaInfo.getSchemaDefinition()),
                       readerSchema, pojoClassLoader, jsr310ConversionEnabled);
   ```
   
   You can see the downloaded schema is the 1st argument to construct an `AvroReader`.
   
   https://github.com/apache/pulsar/blob/b31c5a6a325728b5dc5faebd1a33386952d733d5/pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/reader/AvroReader.java#L58
   
   ```java
       public AvroReader(Schema writerSchema, Schema readerSchema, ClassLoader classLoader,
   ```
   
   As you can see, the downloaded schema is the writer schema. The `MultiVersionAvroReader#readerSchema` is from the consumer and it's the reader schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-client-python] shibd commented on a diff in pull request #119: Fetch writer schema to decode Avro messages

Posted by "shibd (via GitHub)" <gi...@apache.org>.

shibd commented on code in PR #119:
URL: https://github.com/apache/pulsar-client-python/pull/119#discussion_r1203755497


##########
pulsar/schema/schema_avro.py:
##########
@@ -40,6 +42,8 @@ def __init__(self, record_cls, schema_definition=None):
                 self._schema = record_cls.schema()
             else:
                 self._schema = schema_definition
+            self._writer_schemas = dict()

Review Comment:
   This field name defines differently with the Java client.
   
   In JAVA client
   - self._schama is WriterSchema. 
   - self._writer_schemas is ReaderSchema.
   
   https://github.com/apache/pulsar/blob/b31c5a6a325728b5dc5faebd1a33386952d733d5/pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/AvroSchema.java#L57-L59
   
   It‘s better to keep consistent?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-client-python] shibd commented on pull request #119: Fetch writer schema to decode Avro messages

Posted by "shibd (via GitHub)" <gi...@apache.org>.

shibd commented on PR #119:
URL: https://github.com/apache/pulsar-client-python/pull/119#issuecomment-1560784242

   Use this patch. Although flowing define will create two schemas, that's okay, right? It will use write schema of writing that message to deserialize the data.
   
   ```python
   class User(Record):
       name = String()
       age = Integer()
   ```
   
   ```java
       @AllArgsConstructor
       @Getter
       static class User {
           private final String name;
           private final int age;
       }
   ```
   
   Do we need to continue to solve this problem?
   https://github.com/apache/pulsar-client-python/issues/108#issuecomment-1488657932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-client-python] shibd commented on a diff in pull request #119: Fetch writer schema to decode Avro messages

Posted by "shibd (via GitHub)" <gi...@apache.org>.

shibd commented on code in PR #119:
URL: https://github.com/apache/pulsar-client-python/pull/119#discussion_r1204896614


##########
pulsar/schema/schema_avro.py:
##########
@@ -40,6 +42,8 @@ def __init__(self, record_cls, schema_definition=None):
                 self._schema = record_cls.schema()
             else:
                 self._schema = schema_definition
+            self._writer_schemas = dict()

Review Comment:
   I see, thanks for your explanation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar-client-python] BewareMyPower commented on a diff in pull request #119: Fetch writer schema to decode Avro messages

Posted by "BewareMyPower (via GitHub)" <gi...@apache.org>.

BewareMyPower commented on code in PR #119:
URL: https://github.com/apache/pulsar-client-python/pull/119#discussion_r1203838596


##########
pulsar/schema/schema_avro.py:
##########
@@ -40,6 +42,8 @@ def __init__(self, record_cls, schema_definition=None):
                 self._schema = record_cls.schema()
             else:
                 self._schema = schema_definition
+            self._writer_schemas = dict()

Review Comment:
   `fastavro` uses different APIs with the Java Avro library.
   
   https://fastavro.readthedocs.io/en/latest/reader.html?highlight=schemaless_reader#fastavro._read_py.schemaless_reader
   
   - writer_schema – Schema used when calling schemaless_writer
   - reader_schema – If the schema has changed since being written then the new schema can be given to allow for schema migration
   
   The downloaded schema should be the writer schema because Python producers pass that schema to `schemaless_writer`.
   
   As for the Pulsar Java client, it seems that your understanding is wrong.
   
   https://github.com/apache/pulsar/blob/b31c5a6a325728b5dc5faebd1a33386952d733d5/pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/reader/MultiVersionAvroReader.java#L51-L52
   
   ```java
               return new AvroReader<>(parseAvroSchema(schemaInfo.getSchemaDefinition()),
                       readerSchema, pojoClassLoader, jsr310ConversionEnabled);
   ```
   
   You can see the downloaded schema is the 1st argument to construct an `AvroReader`.
   
   https://github.com/apache/pulsar/blob/b31c5a6a325728b5dc5faebd1a33386952d733d5/pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/reader/AvroReader.java#L58
   
   ```java
       public AvroReader(Schema writerSchema, Schema readerSchema, ClassLoader classLoader,
   ```
   
   As you can see, the downloaded schema is the writer schema. The `MultiVersionAvroReader#readerSchema`, which is passed from the consumer, is the reader schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org