You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/08/09 14:23:38 UTC
[GitHub] [pulsar] gotopanic opened a new issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client
gotopanic opened a new issue #7785:
URL: https://github.com/apache/pulsar/issues/7785
**Describe the bug**
The [doc example](http://pulsar.apache.org/docs/en/2.6.0/client-libraries-python/#complex-types) fails with the following error:
> ValueError: <__main__.TestComplexSchema.test_serialize_complex_avro.<locals>.MySubRecord object at 0x7fabd361b7f0> (type <class '__main__.TestComplexSchema.test_serialize_complex_avro.<locals>.MySubRecord'>) do not match ['null', {'type': 'record', 'name': 'MySubRecord', 'fields': [{'name': 'x', 'type': ['null', 'int']}, {'name': 'y', 'type': ['null', 'long']}, {'name': 'z', 'type': ['null', 'string']}]}]
**To Reproduce**
Run the attached test
**Expected behavior**
The complex object should be serialized and deserialized to an equivalent object.
**Additional context**
The offending code is in the `AvroSchema` class in pulsar/pulsar-client-cpp/python/pulsar/schema/schema.py:
```python
class AvroSchema(Schema):
def __init__(self, record_cls):
super(AvroSchema, self).__init__(record_cls, _pulsar.SchemaType.AVRO,
record_cls.schema(), 'AVRO')
self._schema = record_cls.schema()
def _get_serialized_value(self, x):
if isinstance(x, enum.Enum):
return x.name
else:
return x
def encode(self, obj):
self._validate_object_type(obj)
buffer = io.BytesIO()
m = {k: self._get_serialized_value(v) for k, v in obj.__dict__.items()}
fastavro.schemaless_writer(buffer, self._schema, m)
return buffer.getvalue()
def decode(self, data):
buffer = io.BytesIO(data)
d = fastavro.schemaless_reader(buffer, self._schema)
return self._record_cls(**d)
```
Tested on 2.5.2. Code didn't change in 2.6.0.
[pulsar_test.txt](https://github.com/apache/pulsar/files/5047658/pulsar_test.txt)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] HugoPelletier commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client
Posted by GitBox <gi...@apache.org>.
HugoPelletier commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-880169096
@gotopanic If it helps, I'm using version 2.8.0 and we stopped using the AvroSchema. To many issues. We end up using the JsonSchema. I'm aware that it may not be your objective, but I end up overwriting a few methods from the Pulsar-client Python code. Here is what I found
A. the validation is broken
Both classes (Record and Field) have default keyword arguments. The `required=False`(should be True by default IMO) and `default=None`.
All the `validate_type` of those classes are ignoring this. If you have something like this, it will throw an exception
```python
class MySubRecord(Record):
a = String() # required is False
b = String(required=True)
class MyRecord(Record):
c = MySubRecord()
d = String()
```
I end up overwriting the Record class like this:
```python
from typing import Optional
from pulsar.schema import Record as PulsarRecord
class Record(PulsarRecord):
"""
Overwrite the Pulsar.Record class.
The current implementation doesn't allow SubRecord to be None.
"""
def __eq__(self, other: PulsarRecord) -> bool:
"""
Pulsar is failing validating the Array type.
:param other: Pulsar Record object expected
:return: If all the sub-record are valid, return True.
"""
for field in self._fields:
if isinstance(self.__getattribute__(field), list):
return all([True for x in range(len(self.__getattribute__(field)))
if [self.__getattribute__('sub')[x].__dict__] == other.__getattribute__(field)])
elif self.__getattribute__(field) != other.__getattribute__(field):
return False
return True
def validate_type(self, name: str, val: PulsarRecord) -> Optional[PulsarRecord]:
"""
Overwrite the validation. The Pulsar client is not taking in consideration the default value in the
Pulsar.Record constructor and still validate if the value match the expected type.
Here, if the required is set to False, we ignore it and return None.
:param name: name of the parameter
:param val: value. Can be None or an instance of a class that extend the Pulsar.Record class
:return: value
:exception: TypeError
"""
if self._required:
return super().validate_type(name, val)
return None
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] HugoPelletier commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client
Posted by GitBox <gi...@apache.org>.
HugoPelletier commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-881512787
We are in the same situation.
The serialization/deserialization doesn't work with complex schemas.
With a super simple "complex" schema like this
```python
class MySubRecord(Record):
x = Integer()
y = Long()
z = String()
class Example2(Record):
a = String()
sub = MySubRecord()
```
I'm having the following error:
```text
Invalid type '<class 'dict'>' for sub-record field 'sub'. Expected: <class 'tests.mocks.schemas.MySubRecord'>
```
Basically, the Pulsar Python client is unable to rebuild sub-record from the bytes format
```text
b'{\n "a": "1752767a-72be-4945-b1ca-8b0b565c694a",\n "sub": {\n "_required_default": false,\n "_default": null,\n "_required": false,\n "x": 1,\n "y": 2,\n "z": "test"\n }\n}'
```
The problem is [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/pulsar/schema/schema.py#L92). The dump is done at the parent level.
I did a test with a package named [jsonpickle](https://jsonpickle.readthedocs.io/en/latest/) and it solved the issue.
The serialization using the package is making a reference to the Python object. If your codebase is only in Python, that could be a solution. In our case, we cannot use this solution since the data send to Pulsar changes to something like:
```text
b'{"py/object": "tests.mocks.schemas.Example2", "a": "85ece176-f973-4acb-ba80-78d6bb1eeaf8", "sub": {"py/object": "tests.mocks.schemas.MySubRecord", "_required_default": false, "_default": null, "_required": false, "x": 1, "y": 2, "z": "test"}}'
```
We have a mix of Java. C# and Python, so impossible because of the `py/object`
Here is the code sippet to try
```python
class MyJsonSchema(JsonSchema):
def __init__(self, record_cls):
super().__init__(record_cls)
def _get_serialized_value(self, o):
if isinstance(o, enum.Enum):
return o.value
else:
return o.__dict__
def encode(self, obj):
self._validate_object_type(obj)
del obj.__dict__['_default']
del obj.__dict__['_required']
del obj.__dict__['_required_default']
return jsonpickle.encode(obj).encode('utf-8')
def decode(self, data):
return jsonpickle.decode(data)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] gotopanic commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client
Posted by GitBox <gi...@apache.org>.
gotopanic commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-880054445
No update from my side. I ended up not using complex types because of this issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] gotopanic commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client
Posted by GitBox <gi...@apache.org>.
gotopanic commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-880198681
Thank you for the snippets, they may prove useful at some point.
I sticked with the AvroSchema because of the supposed performance gain over JsonSchema. In some places, I am also using the BytesSchema at the expense (or benefit!) of losing the broker-side schema validation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] HugoPelletier commented on issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client
Posted by GitBox <gi...@apache.org>.
HugoPelletier commented on issue #7785:
URL: https://github.com/apache/pulsar/issues/7785#issuecomment-879838789
Does anyone have an update on this?
Code didn't change in 2.8.0 and still has the same error message.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] codelipenghui closed issue #7785: Failure of the Avro Serialization for Complex Types in the Python Client
Posted by GitBox <gi...@apache.org>.
codelipenghui closed issue #7785:
URL: https://github.com/apache/pulsar/issues/7785
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org