You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Paul Wais <pw...@yelp.com> on 2014/09/16 03:28:41 UTC

Spark 1.1 / cdh4 stuck using old hadoop client?

Dear List,

I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles.  In particular, I'm seeing:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
Server IPC version 7 cannot communicate with client version 4
        at org.apache.hadoop.ipc.Client.call(Client.java:1070)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
        at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
        ...

When invoking JavaSparkContext#newAPIHadoopFile().  (With args
validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
BytesWritable.class, new Job().getConfiguration() -- Pretty close to
the unit test here:
https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
)


This error indicates to me that Spark is using an old hadoop client to
do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.


Do I need to explicitly build spark for modern hadoop??  I previously
had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
error (server is using version 9, client is using version 4).


I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
on spark's site:
 * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
 * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz


What distro of hadoop is used at Data Bricks?  Are there distros of
Spark 1.1 and hadoop that should work together out-of-the-box?
(Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)

Thanks for any help anybody can give me here!
-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

Posted by Paul Wais <pw...@yelp.com>.

Hi Sean,

Great catch! Yes I was including Spark as a dependency and it was
making its way into my uber jar.  Following the advice I just found at
Stackoverflow[1],  I marked Spark as a provided dependency and that
appeared to fix my Hadoop client issue.  Thanks for your help!!!
Perhaps they maintainers might consider setting this in the Quickstart
guide pom.xml ( http://spark.apache.org/docs/latest/quick-start.html )

In summary, here's what worked:
 * Hadoop 2.3 cdh5
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz
 * Spark 1.1 for Hadoop 2.3
http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.3.tgz

pom.xml snippets: https://gist.github.com/ypwais/ff188611d4806aa05ed9

[1] http://stackoverflow.com/questions/24747037/how-to-define-a-dependency-scope-in-maven-to-include-a-library-in-compile-run

Thanks everybody!!
-Paul


On Tue, Sep 16, 2014 at 3:55 AM, Sean Owen <so...@cloudera.com> wrote:
> From the caller / application perspective, you don't care what version
> of Hadoop Spark is running on on the cluster. The Spark API you
> compile against is the same. When you spark-submit the app, at
> runtime, Spark is using the Hadoop libraries from the cluster, which
> are the right version.
>
> So when you build your app, you mark Spark as a 'provided' dependency.
> Therefore in general, no, you do not build Spark for yourself if you
> are a Spark app creator.
>
> (Of course, your app would care if it were also using Hadoop libraries
> directly. In that case, you will want to depend on hadoop-client, and
> the right version for your cluster, but still mark it as provided.)
>
> The version Spark is built against only matters when you are deploying
> Spark's artifacts on the cluster to set it up.
>
> Your error suggests there is still a version mismatch. Either you
> deployed a build that was not compatible, or, maybe you are packaging
> a version of Spark with your app which is incompatible and
> interfering.
>
> For example, the artifacts you get via Maven depend on Hadoop 1.0.4. I
> suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
> your app, when it shouldn't be packaged.
>
> Spark works out of the box with just about any modern combo of HDFS and YARN.
>
> On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais <pw...@yelp.com> wrote:
>> Dear List,
>>
>> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
>> reading SequenceFiles.  In particular, I'm seeing:
>>
>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
>> Server IPC version 7 cannot communicate with client version 4
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>         at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
>>         ...
>>
>> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
>> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
>> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
>> the unit test here:
>> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
>> )
>>
>>
>> This error indicates to me that Spark is using an old hadoop client to
>> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
>> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
>>
>>
>> Do I need to explicitly build spark for modern hadoop??  I previously
>> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
>> error (server is using version 9, client is using version 4).
>>
>>
>> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
>> on spark's site:
>>  * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
>>  * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
>>
>>
>> What distro of hadoop is used at Data Bricks?  Are there distros of
>> Spark 1.1 and hadoop that should work together out-of-the-box?
>> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
>>
>> Thanks for any help anybody can give me here!
>> -Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

Posted by Sean Owen <so...@cloudera.com>.

>From the caller / application perspective, you don't care what version
of Hadoop Spark is running on on the cluster. The Spark API you
compile against is the same. When you spark-submit the app, at
runtime, Spark is using the Hadoop libraries from the cluster, which
are the right version.

So when you build your app, you mark Spark as a 'provided' dependency.
Therefore in general, no, you do not build Spark for yourself if you
are a Spark app creator.

(Of course, your app would care if it were also using Hadoop libraries
directly. In that case, you will want to depend on hadoop-client, and
the right version for your cluster, but still mark it as provided.)

The version Spark is built against only matters when you are deploying
Spark's artifacts on the cluster to set it up.

Your error suggests there is still a version mismatch. Either you
deployed a build that was not compatible, or, maybe you are packaging
a version of Spark with your app which is incompatible and
interfering.

For example, the artifacts you get via Maven depend on Hadoop 1.0.4. I
suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
your app, when it shouldn't be packaged.

Spark works out of the box with just about any modern combo of HDFS and YARN.

On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais <pw...@yelp.com> wrote:
> Dear List,
>
> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
> reading SequenceFiles.  In particular, I'm seeing:
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> Server IPC version 7 cannot communicate with client version 4
>         at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>         at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
>         ...
>
> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
> the unit test here:
> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
> )
>
>
> This error indicates to me that Spark is using an old hadoop client to
> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
>
>
> Do I need to explicitly build spark for modern hadoop??  I previously
> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
> error (server is using version 9, client is using version 4).
>
>
> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
> on spark's site:
>  * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
>  * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
>
>
> What distro of hadoop is used at Data Bricks?  Are there distros of
> Spark 1.1 and hadoop that should work together out-of-the-box?
> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
>
> Thanks for any help anybody can give me here!
> -Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

Posted by Christian Chua <cc...@icloud.com>.

Is 1.0.8 working for you ?

You indicated your last known good version is 1.0.0

Maybe we can track down where it broke. 



> On Sep 16, 2014, at 12:25 AM, Paul Wais <pw...@yelp.com> wrote:
> 
> Thanks Christian!  I tried compiling from source but am still getting the same hadoop client version error when reading from HDFS.  Will have to poke deeper... perhaps I've got some classpath issues.  FWIW I compiled using:
> 
> $ MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package 
> 
> and hadoop 2.3 / cdh5 from http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz
> 
> 
> 
> 
> 
>> On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua <cc...@icloud.com> wrote:
>> Hi Paul.
>> 
>> I would recommend building your own 1.1.0 distribution.
>> 
>> 	./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
>> 
>> 
>> 
>> I downloaded the "Pre-build for Hadoop 2.4" binary, and it had this strange behavior where
>> 
>> 	spark-submit --master yarn-cluster ...
>> 
>> will work, but
>> 
>> 	spark-submit --master yarn-client ...
>> 
>> will fail.
>> 
>> 
>> But on the personal build obtained from the command above, both will then work.
>> 
>> 
>> -Christian
>> 
>> 
>> 
>> 
>>> On Sep 15, 2014, at 6:28 PM, Paul Wais <pw...@yelp.com> wrote:
>>> 
>>> Dear List,
>>> 
>>> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
>>> reading SequenceFiles.  In particular, I'm seeing:
>>> 
>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
>>> Server IPC version 7 cannot communicate with client version 4
>>>        at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>>        at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
>>>        ...
>>> 
>>> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
>>> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
>>> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
>>> the unit test here:
>>> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
>>> )
>>> 
>>> 
>>> This error indicates to me that Spark is using an old hadoop client to
>>> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
>>> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
>>> 
>>> 
>>> Do I need to explicitly build spark for modern hadoop??  I previously
>>> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
>>> error (server is using version 9, client is using version 4).
>>> 
>>> 
>>> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
>>> on spark's site:
>>> * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
>>> * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
>>> 
>>> 
>>> What distro of hadoop is used at Data Bricks?  Are there distros of
>>> Spark 1.1 and hadoop that should work together out-of-the-box?
>>> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
>>> 
>>> Thanks for any help anybody can give me here!
>>> -Paul
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

Posted by Paul Wais <pw...@yelp.com>.

Thanks Christian!  I tried compiling from source but am still getting the
same hadoop client version error when reading from HDFS.  Will have to poke
deeper... perhaps I've got some classpath issues.  FWIW I compiled using:

$ MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package

and hadoop 2.3 / cdh5 from
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz





On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua <cc...@icloud.com> wrote:

> Hi Paul.
>
> I would recommend building your own 1.1.0 distribution.
>
> ./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn
> -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
>
>
>
> I downloaded the "Pre-build for Hadoop 2.4" binary, and it had this
> strange behavior where
>
> spark-submit --master yarn-cluster ...
>
> will work, but
>
> spark-submit --master yarn-client ...
>
> will fail.
>
>
> But on the personal build obtained from the command above, both will then
> work.
>
>
> -Christian
>
>
>
>
> On Sep 15, 2014, at 6:28 PM, Paul Wais <pw...@yelp.com> wrote:
>
> Dear List,
>
> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
> reading SequenceFiles.  In particular, I'm seeing:
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> Server IPC version 7 cannot communicate with client version 4
>        at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>        at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
>        ...
>
> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
> the unit test here:
>
> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
> )
>
>
> This error indicates to me that Spark is using an old hadoop client to
> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
>
>
> Do I need to explicitly build spark for modern hadoop??  I previously
> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
> error (server is using version 9, client is using version 4).
>
>
> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
> on spark's site:
> * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
> * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
>
>
> What distro of hadoop is used at Data Bricks?  Are there distros of
> Spark 1.1 and hadoop that should work together out-of-the-box?
> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
>
> Thanks for any help anybody can give me here!
> -Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

Posted by Christian Chua <cc...@icloud.com>.

Hi Paul.

I would recommend building your own 1.1.0 distribution.

	./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests



I downloaded the "Pre-build for Hadoop 2.4" binary, and it had this strange behavior where

	spark-submit --master yarn-cluster ...

will work, but

	spark-submit --master yarn-client ...

will fail.


But on the personal build obtained from the command above, both will then work.


-Christian




On Sep 15, 2014, at 6:28 PM, Paul Wais <pw...@yelp.com> wrote:

> Dear List,
> 
> I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
> reading SequenceFiles.  In particular, I'm seeing:
> 
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> Server IPC version 7 cannot communicate with client version 4
>        at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>        at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
>        ...
> 
> When invoking JavaSparkContext#newAPIHadoopFile().  (With args
> validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
> BytesWritable.class, new Job().getConfiguration() -- Pretty close to
> the unit test here:
> https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
> )
> 
> 
> This error indicates to me that Spark is using an old hadoop client to
> do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
> via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
> 
> 
> Do I need to explicitly build spark for modern hadoop??  I previously
> had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
> error (server is using version 9, client is using version 4).
> 
> 
> I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
> on spark's site:
> * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
> * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
> 
> 
> What distro of hadoop is used at Data Bricks?  Are there distros of
> Spark 1.1 and hadoop that should work together out-of-the-box?
> (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
> 
> Thanks for any help anybody can give me here!
> -Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>