You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Valentyn Tymofieiev (Jira)" <ji...@apache.org> on 2020/08/20 17:25:00 UTC

[jira] [Commented] (BEAM-10769) Fix Avro IO documentation: when fastavro is used, do not pass schema parsed by avro-python3.

    [ https://issues.apache.org/jira/browse/BEAM-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181348#comment-17181348 ] 

Valentyn Tymofieiev commented on BEAM-10769:
--------------------------------------------

Beam switched to use FastAvro as a default library on Python 3. The fastavro-based Avro sink expects schema as a dictionary, while the avro-python3-based Avro Sink expects a schema that was previously parsed by avro.schema.Parse(). Fastavro will not accept a schema parsed by avro-python3.

When a user switches their pipeline with WriteToAvro transform to Python 3, but does not change how schema is passed to the transform and thus passes a schema parsed by avro.schema.Parse(),  fastavro will not be able parse the schema, since FastAvro expects schema as a dictionary. Also FastAvro does not require a parsed schema, although supplying a schema parsed by fastavro works too.

The error may manifest as follows:

{noformat}
...lib/python3.7/site-packages/apache_beam/io/avroio.py", line 634, in open
    return Writer(file_handle, self._schema, self._codec)
  File "fastavro/_write.pyx", line 522, in fastavro._write.Writer.__init__
  File "fastavro/_schema.pyx", line 71, in fastavro._schema.parse_schema
  File "fastavro/_schema.pyx", line 85, in fastavro._schema._parse_schema
TypeError: unhashable type: 'RecordSchema' [while running 'SampleInfoToAvro/WriteToAvroFiles/Write/WriteImpl/WriteBundles']
{noformat}

To fix the error, users should pass the schema to the sink as a dictionary. https://github.com/apache/beam/pull/12638 is out to fix the documentation and catch these errors with a better error message.   

> Fix Avro IO documentation: when fastavro is used, do not pass schema parsed by avro-python3.
> --------------------------------------------------------------------------------------------
>
>                 Key: BEAM-10769
>                 URL: https://issues.apache.org/jira/browse/BEAM-10769
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp
>            Reporter: Valentyn Tymofieiev
>            Assignee: Valentyn Tymofieiev
>            Priority: P2
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)