You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Simone Leo <si...@crs4.it> on 2015/01/19 17:44:32 UTC

Pydoop 1.0.0-rc1

Hello everyone,

we're happy to announce a new major release of Pydoop 
(http://crs4.github.io/pydoop), 1.0.0-rc1.  Pydoop has been almost fully 
rewritten to simplify its installation and usage, and provide a more 
pythonic MapReduce API:

OLD:
     def reduce(self, context):
         s = 0
         while context.nextValue():
             s += int(context.getInputValue())
         context.emit(context.getInputKey(), str(s))

NEW:
     def reduce(self, context):
         s = sum(context.values)
         context.emit(context.key, s)

Note the implicit iteration and the transparent type conversion.  Don't 
worry though: the old API is still supported for backwards compatibility.

Installing Pydoop is now both simpler and faster.  We have reimplemented 
almost everything in pure Python, thus removing some hard-to-install 
dependencies.  To interface with libhdfs and for performance-critical 
sections, such as serialization, we created standard CPython extensions.

Also, Pydoop now supports easy installation-free usage 
(http://crs4.github.io/pydoop/self_contained.html), to run Pydoop-based 
programs without previously installing Pydoop (or the programs 
themselves) on the Hadoop cluster.  The submission of this and other 
types of job is greatly simplified by the new "pydoop submit" command.

In addition, we now support Hadoop 2 up to 2.6.0 and CDH5. The full news 
list is at http://crs4.github.io/pydoop/news.html.

Please note that this is a release candidate that's not been used in 
production yet.  This means, amongst other things, that you have to add 
the "--pre" flag if installing with pip.  As usual, we're happy to 
receive your feedback: please open an issue on GitHub if you spot a bug 
or find something that could be improved.

Pydoop is a Python API for Hadoop that allows you to write full-fledged 
MapReduce applications with HDFS access.  Pydoop powers several 
scientific projects at CRS4, including Seal 
(http://biodoop-seal.sourceforge.net), Biodoop-BLAST 
(http://biodoop.sourceforge.net/blast) and VISPA 
(https://github.com/crs4/vispa), as well as successful
commercial services such as Slacker Radio (http://www.slacker.com).

Links:

  * download: http://pypi.python.org/pypi/pydoop
  * docs: http://crs4.github.io/pydoop
  * git repo: https://github.com/crs4/pydoop
  * paper: dx.doi.org/10.1145/1851476.1851594
  * Dr.Dobb's review: 
http://www.drdobbs.com/database/pydoop-writing-hadoop-programs-in-python/240156473

Happy pydooping!

The Pydoop Team

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it