You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Alvaro Brandon <al...@gmail.com> on 2017/04/07 14:32:23 UTC

Does Spark uses its own HDFS client?

I was going through the SparkContext.textFile() and I was wondering at that
point does Spark communicates with HDFS. Since when you download Spark
binaries you also specify the Hadoop version you will use, I'm guessing it
has its own client that calls HDFS wherever you specify it in the
configuration files.

The goal is to instrument and log all the calls that Spark does to HDFS.
Which class or classes perform these operations?

Re: Does Spark uses its own HDFS client?

Posted by Steve Loughran <st...@hortonworks.com>.
On 7 Apr 2017, at 15:32, Alvaro Brandon <al...@gmail.com>> wrote:

I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specify the Hadoop version you will use, I'm guessing it has its own client that calls HDFS wherever you specify it in the configuration files.



it uses the hadoop-hdfs JAR in spark-assembly JAR or the lib dir under SPARK_HOME. Nobody would ever want to do their own HDFS client, not if you look at the bit of the code related to kerberos. webhdfs://<webhdfs:///>, that you could, though it's not done here.


The goal is to instrument and log all the calls that Spark does to HDFS. Which class or classes perform these operations?



org.apache.hadoop.hdfs.DistributedFileSystem

Take a look at HTrace here: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Tracing.html





Re: Does Spark uses its own HDFS client?

Posted by Jörn Franke <jo...@gmail.com>.
Maybe using ranger or sentry would be the better choice to intercept those calls?

> On 7. Apr 2017, at 16:32, Alvaro Brandon <al...@gmail.com> wrote:
> 
> I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specify the Hadoop version you will use, I'm guessing it has its own client that calls HDFS wherever you specify it in the configuration files.
> 
> The goal is to instrument and log all the calls that Spark does to HDFS. Which class or classes perform these operations?
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org