You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Kiyan Ahmadizadeh (JIRA)" <ji...@apache.org> on 2012/07/10 03:24:33 UTC
[jira] [Updated] (CRUNCH-9) Add support for launching Scrunch
pipelines from a REPL
[ https://issues.apache.org/jira/browse/CRUNCH-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kiyan Ahmadizadeh updated CRUNCH-9:
-----------------------------------
Attachment: CRUNCH-9.patch
This commit modifies the Scrunch project so that Scrunch jobs can be run from
a Scala REPL. Users can run a Scala REPL capable of launching Scrunch jobs by
building Scrunch using `mvn package` and running bin/scrunch from the
distribution directory that results. Several changes have been made to the
project to accomplish this:
1. The project has been modified to produce a release distribution. The
distribution is created by maven when `mvn package` is run. A distribution
folder and tarball are created. The distribution folder contains a bin dir that
contains scripts, a lib dir that contains all library jars, and a log dir that
contains a log4j configuration file.
2. A modified Scala REPL was added to the project. An object InterpreterRunner
was created that launches a Scala REPL. It's a modification of Scala's
MainGenericRunner. The new Scrunch version allows client code to determine if a
REPL is actually running, and includes methods for creating a jar from the code
compiled from REPL input. A script named "scrunch" was added to the project
that, when run, launches this modified Scala REPL. The script is a modification
of the script distributed with Scala that launches the Scala REPL.
3. Scrunch's Pipeline class was modified so that any MapReduce pipeline
constructed automatically adds the Scrunch lib jars to the Distributed Cache of
the job and to the classpaths of run tasks.
4. Methods on PCollection/PTable/etc. that result in a job being launched were
modified to check if the REPL is running and, if so, create a jar of code
compiled from REPL input and ship that jar with the job so that it's on the
classpath of run tasks.
5. To facilitate extensions, From/To/At objects were changed to traits, with
likewise named singleton objects that extend the traits created.
6. The examples in the examples directory, and the script scrunch.py for running
those examples, are included in the project distribution. The scrunch.py script
was renamed to scrunch-job.py and modified to cope with the new project
distribution structure and take advantage of the fact that Scrunch lib jars are
now automatically added to the classpath of run jobs.
I started an integration test for actually launching jobs but the MiniMRCluster
testing framework does not behave properly when jars are added to the
distributed cache. The problem is related to MAPREDUCE-2884. I have verified
that jobs can be launched from the REPL using an actual cluster.
> Add support for launching Scrunch pipelines from a REPL
> -------------------------------------------------------
>
> Key: CRUNCH-9
> URL: https://issues.apache.org/jira/browse/CRUNCH-9
> Project: Crunch
> Issue Type: New Feature
> Components: Scrunch
> Reporter: Josh Wills
> Attachments: CRUNCH-9.patch
>
>
> It would be really, really cool and useful to be able to launch a Scrunch pipeline from a Scala-based REPL, which was one of the killer apps for Cascade, Google's Scala-based wrapper around FlumeJava.
> See the video from Scala Days 2011 for a reference: http://days2011.scala-lang.org/node/138/282
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira