You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Simone Leo <si...@crs4.it> on 2015/03/10 16:52:11 UTC
Pydoop 1.0.0-rc2

Hello everyone,

we're happy to announce the 1.0.0-rc2 release of Pydoop 
(http://crs4.github.io/pydoop), the non-Streaming Python interface to 
Hadoop.  Adding to the simplified installation and new Pythonic API 
introduced with 1.0.0-rc1, this rc provides built-in Avro support (for 
now, only with Hadoop 2).  By setting a few flags in the submitter and 
selecting the new AvroContext as your application's context class, you 
can read and write Avro data, transparently manipulating records as 
Python dictionaries.  For instance, you could count your favorite colors 
stored in an Avro file like this:

    export STATS_SCHEMA=$(cat stats.avsc)
    pydoop submit \
      -D pydoop.mapreduce.avro.value.output.schema="${STATS_SCHEMA}" \
      --avro-input v --avro-output v \
      --upload-file-to-cache color_count.py --mrv2 \
      color_count input output

And your Pydoop code would be these few lines:

    class Mapper(api.Mapper):
        def map(self, ctx):
            user = ctx.value
            color = user['favorite_color']
            if color is not None:
                ctx.emit(user['office'], Counter({color: 1}))

    class Reducer(api.Reducer):
        def reduce(self, ctx):
            s = sum(ctx.values, Counter())
            ctx.emit('', {'office': ctx.key, 'counts': s})

Any input/output format that exchanges Avro records is supported, 
including the Parquet ones.  For more detailed information, see the docs 
at http://crs4.github.io/pydoop/examples/avro.html

Pydoop is a Python API for Hadoop that allows you to write full-fledged 
MapReduce applications with HDFS access.  Pydoop powers several 
scientific projects at CRS4, including Seal 
(http://biodoop-seal.sourceforge.net), Biodoop-BLAST 
(http://biodoop.sourceforge.net/blast) and VISPA 
(https://github.com/crs4/vispa), as well as successful commercial 
services such as Slacker Radio (http://www.slacker.com).

Please note that this is a release candidate that's not been used in 
production yet.  This means, among other things, that you have to add 
the "--pre" flag if installing with pip.  As usual, we're happy to 
receive your feedback: please open an issue on GitHub if you spot a bug 
or find something that could be improved.

Links:

   * download: http://pypi.python.org/pypi/pydoop
   * docs: http://crs4.github.io/pydoop
   * git repo: https://github.com/crs4/pydoop
   * paper: dx.doi.org/10.1145/1851476.1851594
   * Dr.Dobb's review:
http://www.drdobbs.com/database/pydoop-writing-hadoop-programs-in-python/240156473

Happy pydooping!

The Pydoop Team

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it