You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/08/09 14:23:38 UTC

[GitHub] [pulsar] gotopanic opened a new issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client

gotopanic opened a new issue #7785:
URL: https://github.com/apache/pulsar/issues/7785


   **Describe the bug**
   The [doc example](http://pulsar.apache.org/docs/en/2.6.0/client-libraries-python/#complex-types) fails with the following error:
   
   > ValueError: <__main__.TestComplexSchema.test_serialize_complex_avro.<locals>.MySubRecord object at 0x7fabd361b7f0> (type <class '__main__.TestComplexSchema.test_serialize_complex_avro.<locals>.MySubRecord'>) do not match ['null', {'type': 'record', 'name': 'MySubRecord', 'fields': [{'name': 'x', 'type': ['null', 'int']}, {'name': 'y', 'type': ['null', 'long']}, {'name': 'z', 'type': ['null', 'string']}]}]
   
   **To Reproduce**
   Run the attached test
   
   **Expected behavior**
   The complex object should be serialized and deserialized to an equivalent object.
   
   **Additional context**
   The offending code is in the `AvroSchema` class in pulsar/pulsar-client-cpp/python/pulsar/schema/schema.py:
   ```python
   class AvroSchema(Schema):
       def __init__(self, record_cls):
           super(AvroSchema, self).__init__(record_cls, _pulsar.SchemaType.AVRO,
                                            record_cls.schema(), 'AVRO')
           self._schema = record_cls.schema()
   
       def _get_serialized_value(self, x):
           if isinstance(x, enum.Enum):
               return x.name
           else:
               return x
   
       def encode(self, obj):
           self._validate_object_type(obj)
           buffer = io.BytesIO()
           m = {k: self._get_serialized_value(v) for k, v in obj.__dict__.items()}
           fastavro.schemaless_writer(buffer, self._schema, m)
           return buffer.getvalue()
   
       def decode(self, data):
           buffer = io.BytesIO(data)
           d = fastavro.schemaless_reader(buffer, self._schema)
           return self._record_cls(**d)
   ```
   
   Tested on 2.5.2. Code didn't change in 2.6.0.
   [pulsar_test.txt](https://github.com/apache/pulsar/files/5047658/pulsar_test.txt)
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] HugoPelletier commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client

Posted by GitBox <gi...@apache.org>.
HugoPelletier commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-880169096


   @gotopanic If it helps, I'm using version 2.8.0 and we stopped using the AvroSchema. To many issues. We end up using the JsonSchema. I'm aware that it may not be your objective, but I end up overwriting a few methods from the Pulsar-client Python code. Here is what I found
   
   A. the validation is broken
   Both classes (Record and Field) have default keyword arguments. The `required=False`(should be True by default IMO) and `default=None`.
   
   All the `validate_type` of those classes are ignoring this. If you have something like this, it will throw an exception
   
   ```python
   class MySubRecord(Record):
       a = String() # required is False
       b = String(required=True)
   
   class MyRecord(Record):
       c = MySubRecord()
       d = String()
   
   ```
   
   I end up overwriting the Record class like this:
   ```python
   from typing import Optional
   
   from pulsar.schema import Record as PulsarRecord
   
   
   class Record(PulsarRecord):
       """
       Overwrite the Pulsar.Record class.
       The current implementation doesn't allow SubRecord to be None.
       """
   
       def __eq__(self, other: PulsarRecord) -> bool:
           """
           Pulsar is failing validating the Array type.
   
           :param other: Pulsar Record object expected
           :return: If all the sub-record are valid, return True.
           """
           for field in self._fields:
               if isinstance(self.__getattribute__(field), list):
                   return all([True for x in range(len(self.__getattribute__(field)))
                               if [self.__getattribute__('sub')[x].__dict__] == other.__getattribute__(field)])
               elif self.__getattribute__(field) != other.__getattribute__(field):
                   return False
           return True
   
       def validate_type(self, name: str, val: PulsarRecord) -> Optional[PulsarRecord]:
           """
           Overwrite the validation. The Pulsar client is not taking in consideration the default value in the
           Pulsar.Record constructor and still validate if the value match the expected type.
           Here, if the required is set to False, we ignore it and return None.
   
           :param name: name of the parameter
           :param val: value. Can be None or an instance of a class that extend the Pulsar.Record class
           :return: value
           :exception: TypeError
           """
   
           if self._required:
               return super().validate_type(name, val)
   
           return None
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] HugoPelletier commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client

Posted by GitBox <gi...@apache.org>.
HugoPelletier commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-881512787


   We are in the same situation.
   The serialization/deserialization doesn't work with complex schemas.
   
   With a super simple "complex" schema like this
   
   ```python
   class MySubRecord(Record):
       x = Integer()
       y = Long()
       z = String()
   
   class Example2(Record):
       a = String()
       sub = MySubRecord()
   ```
   
   I'm having the following error:
   ```text
   Invalid type '<class 'dict'>' for sub-record field 'sub'. Expected: <class 'tests.mocks.schemas.MySubRecord'>
   ```
   
   Basically, the Pulsar Python client is unable to rebuild sub-record from the bytes format
   
   ```text
   b'{\n "a": "1752767a-72be-4945-b1ca-8b0b565c694a",\n "sub": {\n  "_required_default": false,\n  "_default": null,\n  "_required": false,\n  "x": 1,\n  "y": 2,\n  "z": "test"\n }\n}'
   ```
   
   The problem is [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/pulsar/schema/schema.py#L92). The dump is done at the parent level.
   
   I did a test with a package named [jsonpickle](https://jsonpickle.readthedocs.io/en/latest/) and it solved the issue. 
   The serialization using the package is making a reference to the Python object. If your codebase is only in Python, that could be a solution. In our case, we cannot use this solution since the data send to Pulsar changes to something like: 
   
   ```text
   b'{"py/object": "tests.mocks.schemas.Example2", "a": "85ece176-f973-4acb-ba80-78d6bb1eeaf8", "sub": {"py/object": "tests.mocks.schemas.MySubRecord", "_required_default": false, "_default": null, "_required": false, "x": 1, "y": 2, "z": "test"}}'
   ```
   
   We have a mix of Java. C# and Python, so impossible because of the `py/object`
   
   Here is the code sippet to try
   
   ```python
   class MyJsonSchema(JsonSchema):
       def __init__(self, record_cls):
           super().__init__(record_cls)
       
       def _get_serialized_value(self, o):
           if isinstance(o, enum.Enum):
               return o.value
           else:
               return o.__dict__
       
       def encode(self, obj):
           self._validate_object_type(obj)
           del obj.__dict__['_default']
           del obj.__dict__['_required']
           del obj.__dict__['_required_default']
           return jsonpickle.encode(obj).encode('utf-8')
       
       def decode(self, data):
           return jsonpickle.decode(data)
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] gotopanic commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client

Posted by GitBox <gi...@apache.org>.
gotopanic commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-880054445


   No update from my side. I ended up not using complex types because of this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] gotopanic commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client

Posted by GitBox <gi...@apache.org>.
gotopanic commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-880198681


   Thank you for the snippets, they may prove useful at some point.
   I sticked with the AvroSchema because of the supposed performance gain over JsonSchema. In some places, I am also using the BytesSchema at the expense (or benefit!) of losing the broker-side schema validation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] HugoPelletier commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client

Posted by GitBox <gi...@apache.org>.
HugoPelletier commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-879838789


   Does anyone have an update on this?
   Code didn't change in 2.8.0 and still has the same error message.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui closed issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client

Posted by GitBox <gi...@apache.org>.
codelipenghui closed issue #7785:
URL: https://github.com/apache/pulsar/issues/7785


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org