You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Paul Mackles <pa...@loopr.com> on 2017/11/17 23:15:29 UTC

History server and non-HDFS filesystems

Hi - I had originally posted this as a bug (SPARK-22528) but given my
uncertainty, it was suggested that I send it to the mailing list instead...

We are using Azure Data Lake (ADL) to store our event logs. This worked
fine in 2.1.x, but in 2.2.0 the underlying files are no longer visible to
the history server - even though we are using the same service principal
that was used to write the logs. I tracked it down to this call in
"FSHistoryProvider" (which was added for v2.2.0):


SparkHadoopUtil.checkAccessPermission()


From what I can tell, it is preemptively checking the permissions on the
files and skipping the ones which it thinks are not readable. The problem
is that its using a check that appears to be specific to HDFS and so even
though the files are definitely readable, it skips over them. Also,
"FSHistoryProvider"
is the only place this code is used.

I was able to workaround it by either:

* setting the permissions for the files on ADL to world readable

* or setting HADOOP_PROXY to the objectId of the Azure service principal
which owns file

Neither of these workarounds are acceptable for our environment. That said,
I am not sure how this should be addressed:

* Is this an issue with the Azure/Hadoop not complying with how the Hadoop
FileSystem interface/contract in some way?

* Is this an issue with "checkAccessPermission()" not really accounting for
all of the possible FileSystem implementations?

My gut tells me its the latter because the
SparkHadoopUtil.checkAccessPermission()
gets its "currentUser" info from outside of the FileSystem class and it
doesn't make sense to me that an instance of FileSystem would affect a
global context since there could be many FileSytem instances in a given
app.

That said, I know ADL is not heavily used at this time so I wonder if
anyone is seeing this with S3 as well? Maybe not since S3 permissions are
always reported as world-readable (I think) which causes
checkAccessPermission()
to succeed.

Any thoughts or feedback appreciated.

-- 
Thanks,
Paul