You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Simone Leo <si...@crs4.it> on 2015/03/10 16:52:11 UTC
Pydoop 1.0.0-rc2
Hello everyone,
we're happy to announce the 1.0.0-rc2 release of Pydoop
(http://crs4.github.io/pydoop), the non-Streaming Python interface to
Hadoop. Adding to the simplified installation and new Pythonic API
introduced with 1.0.0-rc1, this rc provides built-in Avro support (for
now, only with Hadoop 2). By setting a few flags in the submitter and
selecting the new AvroContext as your application's context class, you
can read and write Avro data, transparently manipulating records as
Python dictionaries. For instance, you could count your favorite colors
stored in an Avro file like this:
export STATS_SCHEMA=$(cat stats.avsc)
pydoop submit \
-D pydoop.mapreduce.avro.value.output.schema="${STATS_SCHEMA}" \
--avro-input v --avro-output v \
--upload-file-to-cache color_count.py --mrv2 \
color_count input output
And your Pydoop code would be these few lines:
class Mapper(api.Mapper):
def map(self, ctx):
user = ctx.value
color = user['favorite_color']
if color is not None:
ctx.emit(user['office'], Counter({color: 1}))
class Reducer(api.Reducer):
def reduce(self, ctx):
s = sum(ctx.values, Counter())
ctx.emit('', {'office': ctx.key, 'counts': s})
Any input/output format that exchanges Avro records is supported,
including the Parquet ones. For more detailed information, see the docs
at http://crs4.github.io/pydoop/examples/avro.html
Pydoop is a Python API for Hadoop that allows you to write full-fledged
MapReduce applications with HDFS access. Pydoop powers several
scientific projects at CRS4, including Seal
(http://biodoop-seal.sourceforge.net), Biodoop-BLAST
(http://biodoop.sourceforge.net/blast) and VISPA
(https://github.com/crs4/vispa), as well as successful commercial
services such as Slacker Radio (http://www.slacker.com).
Please note that this is a release candidate that's not been used in
production yet. This means, among other things, that you have to add
the "--pre" flag if installing with pip. As usual, we're happy to
receive your feedback: please open an issue on GitHub if you spot a bug
or find something that could be improved.
Links:
* download: http://pypi.python.org/pypi/pydoop
* docs: http://crs4.github.io/pydoop
* git repo: https://github.com/crs4/pydoop
* paper: dx.doi.org/10.1145/1851476.1851594
* Dr.Dobb's review:
http://www.drdobbs.com/database/pydoop-writing-hadoop-programs-in-python/240156473
Happy pydooping!
The Pydoop Team
--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it