You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gerard Maas <ge...@gmail.com> on 2016/02/05 12:58:26 UTC

Hadoop credentials missing in some tasks?

Hi,

We're facing a situation where simple queries to parquet files stored in
Swift through a Hive Metastore sometimes fail with this exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 6
in stage 58.0 failed 4 times, most recent failure: Lost task 6.3 in stage
58.0 (TID 412, agent-1.mesos.private):
org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing
mandatory configuration option: fs.swift.service.######.auth.url
at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:219)
(...)

Queries requiring a full table scan, like select(count(*)) would fail with
the mentioned exception while smaller chunks of work like " select *
 from... LIMIT 5" would succeed.

The problem seems to relate to the number of tasks scheduled:

If we force a reduction of the number of tasks to 1, the job  succeeds:

dataframe.rdd.coalesce(1).count()

Would return a correct result while

dataframe.count() would fail with the exception mentioned  above.

To me, it looks like credentials are lost somewhere in the serialization
path when the tasks are submitted to the cluster.  I have not found an
explanation yet to why a job that requires only one task succeeds.

We are running on Apache Zepellin  for Swift and Spark Notebook for S3.
Both show an equivalent exception within their specific hadoop filesystem
implementation when the task fails:

Zepelling + Swift:

org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing
mandatory configuration option: fs.swift.service.######.auth.url

Spark Notebook + S3:

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key
must be specified as the username or password (respectively) of a s3n URL,
or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey
properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)

Valid credentials are being set programmatically through
sc.hadoopConfiguration

Our system: Zepellin or Spark Notebook with Spark 1.5.1 running on Docker,
Docker running on Mesos, Hadoop 2.4.0. One environment running on Softlayer
(Swift) and other Amazon EC2 (S3) of similar sizes.

Any ideas on how to address this issue or figure out what's going on??

Thanks,  Gerard.

Re: Hadoop credentials missing in some tasks?

Posted by Peter Vandenabeele <pe...@vandenabeele.com>.
On Fri, Feb 5, 2016 at 12:58 PM, Gerard Maas <ge...@gmail.com> wrote:

> Hi,
>
> We're facing a situation where simple queries to parquet files stored in
> Swift through a Hive Metastore sometimes fail with this exception:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 6
> in stage 58.0 failed 4 times, most recent failure: Lost task 6.3 in stage
> 58.0 (TID 412, agent-1.mesos.private):
> org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing
> mandatory configuration option: fs.swift.service.######.auth.url
> at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:219)
> (...)
>
> Queries requiring a full table scan, like select(count(*)) would fail with
> the mentioned exception while smaller chunks of work like " select *
>  from... LIMIT 5" would succeed.
>

...

An update:

When using the Zeppelin Notebook on a Mesos cluster, as a _workaround_ I
can get the Notebook running
reliably when using this setting and starting with this paragraph:

* spark.mesos.coarse = true

|| import util.Random.nextInt
|| sc.parallelize((0 to 1000).toList,
20).toDF.write.parquet(s"swift://###/test/${util.Random.nextInt}"

This parquet write will touch all the executors (4 worker nodes in this
experiment).

So, it looks like _writing_ once, at the start of the Notebook will
distribute the swift authentication
data to the executors and after that, alle queries just work (including the
count(*) queries that failed
before).

This is using a Zeppelin notebook with Spark 1.5.1 with Hadoop 2.4.

HTH,

Peter