You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Daithi O Crualaoich <da...@guardian.co.uk> on 2013/01/02 20:53:41 UTC

Re: Crunch with Elastic MapReduce

On 14/08/12 23:55, Shawn Smith wrote:
> Has anyone tried using Crunch with Amazon Elastic MapReduce?  I've run
> into a few issues, and I thought I'd share my experiences so far:

Thank you. You made this much easier for me.


> 1. A typical Elastic MapReduce job uses S3 input and output files
> (w/Amazon's customized Native S3 File System) and HDFS intermediate
> files.  This doesn't work with Crunch calls to
> FileSystem.get(Configuration) that assume the default file system
> (HDFS).  Example stack trace:
>
>     Exception in thread "main" java.lang.IllegalArgumentException: This
>     file system object (hdfs://10.114.37.65:9000) does not support
>     access to the request path 's3://test-bucket/test/Input.avro' You
>     possibly called FileSystem.get(conf) when you should have called
>     FileSystem.get(uri, conf) to obtain a file system supporting your path.
>
>     at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>     at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
>     at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513)
>     at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:767)
>     at org.apache.crunch.io.SourceTargetHelper.getPathSize(SourceTargetHelper.java:44)
>
> It looks like switching to Path.getFileSystem(Configuration) throughout
> allows mixing S3 and HDFS files.

There's another one of these in
https://issues.apache.org/jira/browse/CRUNCH-138 that has just been
merged.


> 3. EMR Hadoop 1.0.3 includes Avro 1.5.3 which apparently takes
> precedence over Crunch's Avro 1.7.0.  I didn't mess around with trying
> to get my classes in the class path first…  Instead I used the
> maven-shade-plugin in my job's build to shade Avro 1.7.0 from
> "org.apache.avro.*" to "shaded.org.apache.avro.*" so it wouldn't
> conflict with the EMR version of Avro.  Example exception (you can see
> the Avro source code line numbers correspond to version 1.5.3):
>
>     2012-08-13 06:50:57,547 WARN org.apache.hadoop.mapred.Child (main):
>     Error running child
>     java.lang.RuntimeException: java.lang.NoSuchMethodException:
>     org.apache.avro.mapred.Pair.<init>()
>     at
>     org.apache.avro.specific.SpecificDatumReader.newInstance(SpecificDatumReader.java:101)
>     at
>     org.apache.avro.specific.SpecificDatumReader.newRecord(SpecificDatumReader.java:56)
>

This kind of shade is difficult to do using SBT for a Scrunch project.
Messing around with the Hadoop classpath variables in AWS EMR
bootstrap actions is no fun either. Instead, a quick and nasty hack is
to remove the conflicting Avro jar from the Hadoop installations using
a bootstrap action:

    #!/bin/bash
    # Remove Avro 1.5.3 from Amazon Hadoop 1.0.3 to fix Crunch conflict.
    rm /home/hadoop/lib/avro-1.5.3.jar



Daithi
Please consider the environment before printing this email.
------------------------------------------------------------------
Visit guardian.co.uk - website of the year
 
www.guardian.co.uk    www.observer.co.uk     www.guardiannews.com 
 
On your mobile, visit m.guardian.co.uk or download the Guardian
iPhone app www.guardian.co.uk/iphone and iPad edition www.guardian.co.uk/iPad 
 
Save up to 37% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. 
Visit guardian.co.uk/subscribe
 
---------------------------------------------------------------------
This e-mail and all attachments are confidential and may also
be privileged. If you are not the named recipient, please notify
the sender and delete the e-mail and all attachments immediately.
Do not disclose the contents to another person. You may not use
the information for any purpose, or store, or copy, it in any way.
 
Guardian News & Media Limited is not liable for any computer
viruses or other material transmitted with or as part of this
e-mail. You should employ virus checking software.
 
Guardian News & Media Limited
 
A member of Guardian Media Group plc
Registered Office
PO Box 68164
Kings Place
90 York Way
London
N1P 2AP
 
Registered in England Number 908396