You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@avro.apache.org by fo...@apache.org on 2019/07/04 09:36:48 UTC
[avro] branch master updated: AVRO-2441: Add quickstart guide for
Python3 (#558)
This is an automated email from the ASF dual-hosted git repository.
fokko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/avro.git
The following commit(s) were added to refs/heads/master by this push:
new 9beb79d AVRO-2441: Add quickstart guide for Python3 (#558)
9beb79d is described below
commit 9beb79def587e4ee7056a91cc3e21d6392b0ff94
Author: Kengo Seki <se...@apache.org>
AuthorDate: Thu Jul 4 18:36:43 2019 +0900
AVRO-2441: Add quickstart guide for Python3 (#558)
* AVRO-2441: Add quickstart guide for Python3
* AVRO-2441: Add quickstart guide for Python3
Add a description about installation with pip
* AVRO-2441: Add quickstart guide for Python3
Remove an unnecessary comment
* AVRO-2441: Add quickstart guide for Python3
Make the installation commands consistent to install
Avro onto the current user's local directory.
---
doc/src/content/xdocs/gettingstartedpython3.xml | 235 ++++++++++++++++++++++++
doc/src/content/xdocs/site.xml | 3 +-
2 files changed, 237 insertions(+), 1 deletion(-)
diff --git a/doc/src/content/xdocs/gettingstartedpython3.xml b/doc/src/content/xdocs/gettingstartedpython3.xml
new file mode 100644
index 0000000..46ededa
--- /dev/null
+++ b/doc/src/content/xdocs/gettingstartedpython3.xml
@@ -0,0 +1,235 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ https://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
+ "https://forrest.apache.org/dtd/document-v20.dtd" [
+ <!ENTITY % avro-entities PUBLIC "-//Apache//ENTITIES Avro//EN"
+ "../../../../build/avro.ent">
+ %avro-entities;
+]>
+<document>
+ <header>
+ <title>Apache Avro™ &AvroVersion; Getting Started (Python3)</title>
+ </header>
+ <body>
+ <p>
+ This is a short guide for getting started with Apache Avro™ using
+ Python3. This guide only covers using Avro for data serialization; see
+ Patrick Hunt's <a href="https://github.com/phunt/avro-rpc-quickstart">Avro
+ RPC Quick Start</a> for a good introduction to using Avro for RPC.
+ </p>
+
+ <section id="download_install">
+ <title>Download</title>
+ <p>
+ Avro implementations for C, C++, C#, Java, PHP, Python, and Ruby can be
+ downloaded from the <a
+ href="https://avro.apache.org/releases.html">Apache Avro™
+ Releases</a> page. This guide uses Avro &AvroVersion;, the latest
+ version at the time of writing. Download and unzip
+ <em>avro-python3-&AvroVersion;.tar.gz</em>, and install via <code>python3
+ setup.py</code>. Ensure that you can <code>import avro</code> from a
+ Python prompt.
+ </p>
+ <source>
+$ tar xvf avro-python3-&AvroVersion;.tar.gz
+$ cd avro-python3-&AvroVersion;
+$ python3 setup.py install --user
+$ python3
+>>> import avro # should not raise ImportError
+ </source>
+ <p>
+ Or, it's published as <a href="https://pypi.org/project/avro-python3/">
+ the avro-python3 package</a> on <a href="https://pypi.org/">PyPI</a>,
+ so you can also use pip for installation:
+ </p>
+ <source>
+$ pip3 install avro-python3
+$ python3
+>>> import avro # should not raise ImportError
+ </source>
+ <p>
+ Alternatively, you may build the Avro Python library from source. From
+ your the root Avro directory, run the commands
+ </p>
+ <source>
+$ cd lang/py3/
+$ python3 setup.py install --user
+$ python3
+>>> import avro # should not raise ImportError
+ </source>
+ </section>
+
+ <section>
+ <title>Defining a schema</title>
+ <p>
+ Avro schemas are defined using JSON. Schemas are composed of <a
+ href="spec.html#schema_primitive">primitive types</a>
+ (<code>null</code>, <code>boolean</code>, <code>int</code>,
+ <code>long</code>, <code>float</code>, <code>double</code>,
+ <code>bytes</code>, and <code>string</code>) and <a
+ href="spec.html#schema_complex">complex types</a> (<code>record</code>,
+ <code>enum</code>, <code>array</code>, <code>map</code>,
+ <code>union</code>, and <code>fixed</code>). You can learn more about
+ Avro schemas and types from the specification, but for now let's start
+ with a simple schema example, <em>user.avsc</em>:
+ </p>
+ <source>
+{"namespace": "example.avro",
+ "type": "record",
+ "name": "User",
+ "fields": [
+ {"name": "name", "type": "string"},
+ {"name": "favorite_number", "type": ["int", "null"]},
+ {"name": "favorite_color", "type": ["string", "null"]}
+ ]
+}
+ </source>
+ <p>
+ This schema defines a record representing a hypothetical user. (Note
+ that a schema file can only contain a single schema definition.) At
+ minimum, a record definition must include its type (<code>"type":
+ "record"</code>), a name (<code>"name": "User"</code>), and fields, in
+ this case <code>name</code>, <code>favorite_number</code>, and
+ <code>favorite_color</code>. We also define a namespace
+ (<code>"namespace": "example.avro"</code>), which together with the name
+ attribute defines the "full name" of the schema
+ (<code>example.avro.User</code> in this case).
+
+ </p>
+ <p>
+ Fields are defined via an array of objects, each of which defines a name
+ and type (other attributes are optional, see the <a
+ href="spec.html#schema_record">record specification</a> for more
+ details). The type attribute of a field is another schema object, which
+ can be either a primitive or complex type. For example, the
+ <code>name</code> field of our User schema is the primitive type
+ <code>string</code>, whereas the <code>favorite_number</code> and
+ <code>favorite_color</code> fields are both <code>union</code>s,
+ represented by JSON arrays. <code>union</code>s are a complex type that
+ can be any of the types listed in the array; e.g.,
+ <code>favorite_number</code> can either be an <code>int</code> or
+ <code>null</code>, essentially making it an optional field.
+ </p>
+ </section>
+
+ <section>
+ <title>Serializing and deserializing without code generation</title>
+ <p>
+ Data in Avro is always stored with its corresponding schema, meaning we
+ can always read a serialized item, regardless of whether we know the
+ schema ahead of time. This allows us to perform serialization and
+ deserialization without code generation. Note that the Avro Python
+ library does not support code generation.
+ </p>
+ <p>
+ Try running the following code snippet, which serializes two users to a
+ data file on disk, and then reads back and deserializes the data file:
+ </p>
+ <source>
+import avro.schema
+from avro.datafile import DataFileReader, DataFileWriter
+from avro.io import DatumReader, DatumWriter
+
+schema = avro.schema.Parse(open("user.avsc", "rb").read())
+
+writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
+writer.append({"name": "Alyssa", "favorite_number": 256})
+writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
+writer.close()
+
+reader = DataFileReader(open("users.avro", "rb"), DatumReader())
+for user in reader:
+ print(user)
+reader.close()
+ </source>
+ <p>This outputs:</p>
+ <source>
+{u'favorite_color': None, u'favorite_number': 256, u'name': u'Alyssa'}
+{u'favorite_color': u'red', u'favorite_number': 7, u'name': u'Ben'}
+ </source>
+ <p>
+ Do make sure that you open your files in binary mode (i.e. using the modes
+ <code>wb</code> or <code>rb</code> respectively). Otherwise you might
+ generate corrupt files due to
+ <a href="https://docs.python.org/library/functions.html#open">
+ automatic replacement</a> of newline characters with the
+ platform-specific representations.
+ </p>
+ <p>
+ Let's take a closer look at what's going on here.
+ </p>
+ <source>
+schema = avro.schema.Parse(open("user.avsc", "rb").read())
+ </source>
+ <p>
+ <code>avro.schema.Parse</code> takes a string containing a JSON schema
+ definition as input and outputs a <code>avro.schema.Schema</code> object
+ (specifically a subclass of <code>Schema</code>, in this case
+ <code>RecordSchema</code>). We're passing in the contents of our
+ user.avsc schema file here.
+ </p>
+ <source>
+writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
+ </source>
+ <p>
+ We create a <code>DataFileWriter</code>, which we'll use to write
+ serialized items to a data file on disk. The
+ <code>DataFileWriter</code> constructor takes three arguments:
+ </p>
+ <ul>
+ <li>The file we'll serialize to</li>
+ <li>A <code>DatumWriter</code>, which is responsible for actually
+ serializing the items to Avro's binary format.</li>
+ <li>The schema we're using. The <code>DataFileWriter</code> needs the
+ schema both to write the schema to the data file, and to verify that
+ the items we write are valid items and write the appropriate
+ fields.</li>
+ </ul>
+ <source>
+writer.append({"name": "Alyssa", "favorite_number": 256})
+writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
+ </source>
+ <p>
+ We use <code>DataFileWriter.append</code> to add items to our data
+ file. Avro records are represented as Python <code>dict</code>s.
+ Since the field <code>favorite_color</code> has type <code>["int",
+ "null"]</code>, we are not required to specify this field, as shown in
+ the first append. Were we to omit the required <code>name</code>
+ field, an exception would be raised. Any extra entries not
+ corresponding to a field are present in the <code>dict</code> are
+ ignored.
+ </p>
+ <source>
+reader = DataFileReader(open("users.avro", "rb"), DatumReader())
+ </source>
+ <p>
+ We open the file again, this time for reading back from disk. We use
+ a <code>DataFileReader</code> and <code>DatumReader</code> analagous
+ to the <code>DataFileWriter</code> and <code>DatumWriter</code> above.
+ </p>
+ <source>
+for user in reader:
+ print(user)
+ </source>
+ <p>
+ The <code>DataFileReader</code> is an iterator that returns
+ <code>dict</code>s corresponding to the serialized items.
+ </p>
+ </section>
+ </body>
+</document>
diff --git a/doc/src/content/xdocs/site.xml b/doc/src/content/xdocs/site.xml
index e77517a..6f86841 100644
--- a/doc/src/content/xdocs/site.xml
+++ b/doc/src/content/xdocs/site.xml
@@ -42,7 +42,8 @@ See https://forrest.apache.org/docs/linking.html for more info
<docs label="Documentation">
<overview label="Overview" href="index.html" />
<gettingstartedjava label="Getting started (Java)" href="gettingstartedjava.html" />
- <gettingstartedpython label="Getting started (Python)" href="gettingstartedpython.html" />
+ <gettingstartedpython label="Getting started (Python2)" href="gettingstartedpython.html" />
+ <gettingstartedpython label="Getting started (Python3)" href="gettingstartedpython3.html" />
<spec label="Specification" href="spec.html" />
<trevni label="Trevni" href="ext:trevni/spec" />
<java-api label="Java API" href="ext:api/java/index" />