You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "assaf.mendelson" <as...@rsa.com> on 2017/01/16 10:35:18 UTC

spark support on windows

Hi,
In the documentation it says spark is supported on windows.
The problem, however, is that the documentation description on windows is lacking. There are sources (such as https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html and many more) which explain how to make spark run on windows, however, they all involve downloading a third party winutil.exe file.
Since this file is downloaded from a repository belonging to a private person, this can be an issue (e.g. getting approval to install on a company computer can be an issue).
There are tons of jira tickets on the subject (most are marked as duplicate or not a problem), however, I believe that if we say spark is supported on windows there should be a clear explanation on how to run it and one shouldn't have to use executable from a private person.

If indeed using winutil.exe is the correct solution, I believe it should be bundled to the spark binary distribution along with clear instructions on how to add it.
Assaf.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-support-on-windows-tp20614.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: spark support on windows

Posted by Steve Loughran <st...@hortonworks.com>.
On 16 Jan 2017, at 10:35, assaf.mendelson <as...@rsa.com>> wrote:

Hi,
In the documentation it says spark is supported on windows.
The problem, however, is that the documentation description on windows is lacking. There are sources (such as https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html and many more) which explain how to make spark run on windows, however, they all involve downloading a third party winutil.exe file.
Since this file is downloaded from a repository belonging to a private person,

A repository belonging to me, stevel@apache.org<ma...@apache.org>

this can be an issue (e.g. getting approval to install on a company computer can be an issue).


An a committer on the Hadoop PMC, those signed artifacts are no less trustworthy than anything you get from the ASF itself. It's clean built off a windows VM that is only ever used for build/test of Hadoop code, no other use at all; the VM is powered off most of its life. This actually makes it less of a security risk than the main desktop. And you can check the GPG signature of the artifacts to see they've not been tampered with.

There are tons of jira tickets on the subject (most are marked as duplicate or not a problem), however, I believe that if we say spark is supported on windows there should be a clear explanation on how to run it and one shouldn’t have to use executable from a private person.

While I recognise your concerns, if I wanted to run code on your machines, rest assured, I wouldn't do it in such an obvious way.

I'd do it via transitive maven artifacts with a harmless name like "org.example.xml-unit-diags" which would so something useful except in the special case that is' running on code in your subnet, get a patch a pom.xml to pull it into org.apache.hadoop somewhere, release a version of hadoop with that dependency, then wait for it to propagate downstream into everything, including all those server farms running linux only.

Writing a malicious windows native excutable would require me to write C/C++ windows code, and I don't want to go there.

Of course, if I did any of these I'd be in trouble when caught, lose my job, never be trusted to submit a line of code to any OSS project, lose all my friends, etc, etc. I have nothing to gain by doing so.

If you really don't trust me the instructions for building it are up online; build  a windows system for compiuling hadoop, check out the branch and then go

  mvn -T 1C package -Pdist -Dmaven.javadoc.skip=true -DskipTests

Or go to hortonworks.com<http://hortonworks.com>, download the windows version and lift the windows binaries. Same thing, built by a colleague-managed release VM.


If indeed using winutil.exe is the correct solution, I believe it should be bundled to the spark binary distribution along with clear instructions on how to add it.
I recognise that it is good to question the provenance of every line of code executed on machines you care about. I am reasonably confident as so the quality of this code; given the fact it was a checkout & build of the ASF tagged release, then signed my me, it'd either need my VM corrupted, my VM's feed from the ASF HTTPS repo subverted by a fake SSL cert, or by someone getting hold of my GPG key and github keys and uploading something malicious in my name. Interestingly, that is a vulnerability, one I covered last year in my "Household infosec in a post-sony era: talk: https://www.youtube.com/watch?v=tcRjG1CCrPs

You'll be pleased to know that the relevant keys now live on a yubikey, so even malicious code executed on my desktop cannot get the secrets off the (encrypted) local drive. It'd need physical access to the key, and I'd notice it was missing, revoke everything, etc, etc, making the risk of my keys being stolen low. That leaves the general problem of "our entire build process is based on the assumption that we truest the maven repositories and the people who wrote the JARs"

That's a far more serious problem than the provenance of a single exe file on github

-Steve

Re: spark support on windows

Posted by Steve Loughran <st...@hortonworks.com>.
On 16 Jan 2017, at 11:06, Hyukjin Kwon <gu...@gmail.com>> wrote:

Hi,

I just looked through Jacek's page and I believe that is the correct way.

That seems to be a Hadoop library specific issue[1]. Up to my knowledge, winutils and the binaries in the private repo
 are built by a Hadoop PMC member on a dedicated Windows VM which I believe are pretty trustable.

thank you :)

I also check out and build the specific git commit SHA1 of the release, not any (moveable) tag, so we have identical sources for my builds as the matching releases.

This can be compile from the source. If you think it is not reliable and not safe, you can go and build it by your self.

I agree it would be great if there are documentation about this as we have a weak promise for Windows[2] and
I believe it always require some overhead to install Spark on Windows. FWIW, In case of SparkR, there are some
documentation [3].

For bundling it, it seems even Hadoop itself does not include this in their releases. I think documentation would be
enough.

Really, Hadoop itself should be doing the release of the windows binaries. It's just it complicates the release process as the linux build/test/release would have to be done, then somehow the windows stuff would need to be done on another machine and mixed in. That's the real barrier: extra work. That said, maybe it's time.




For many JIRAs, at least I am resolving it one by one.

I hope my answer is helpful and makes sense.

Thanks.


[1] https://wiki.apache.org/hadoop/WindowsProblems
[2] https://github.com/apache/spark/blob/f3a3fed76cb74ecd0f46031f337576ce60f54fb2/docs/index.md
[3] https://github.com/apache/spark/blob/master/R/WINDOWS.md


2017-01-16 19:35 GMT+09:00 assaf.mendelson <as...@rsa.com>>:
Hi,
In the documentation it says spark is supported on windows.
The problem, however, is that the documentation description on windows is lacking. There are sources (such as https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html and many more) which explain how to make spark run on windows, however, they all involve downloading a third party winutil.exe file.
Since this file is downloaded from a repository belonging to a private person, this can be an issue (e.g. getting approval to install on a company computer can be an issue).
There are tons of jira tickets on the subject (most are marked as duplicate or not a problem), however, I believe that if we say spark is supported on windows there should be a clear explanation on how to run it and one shouldn’t have to use executable from a private person.

If indeed using winutil.exe is the correct solution, I believe it should be bundled to the spark binary distribution along with clear instructions on how to add it.
Assaf.

________________________________
View this message in context: spark support on windows<http://apache-spark-developers-list.1001551.n3.nabble.com/spark-support-on-windows-tp20614.html>
Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com<http://Nabble.com>.



Re: spark support on windows

Posted by Hyukjin Kwon <gu...@gmail.com>.
Hi,

I just looked through Jacek's page and I believe that is the correct way.

That seems to be a Hadoop library specific issue[1]. Up to my
knowledge, winutils and the binaries in the private repo
 are built by a Hadoop PMC member on a dedicated Windows VM which I believe
are pretty trustable.
This can be compile from the source. If you think it is not reliable and
not safe, you can go and build it by your self.

I agree it would be great if there are documentation about this as we have
a weak promise for Windows[2] and
I believe it always require some overhead to install Spark on Windows.
FWIW, In case of SparkR, there are some
documentation [3].

For bundling it, it seems even Hadoop itself does not include this in their
releases. I think documentation would be
enough.

For many JIRAs, at least I am resolving it one by one.

I hope my answer is helpful and makes sense.

Thanks.


[1] https://wiki.apache.org/hadoop/WindowsProblems
[2]
https://github.com/apache/spark/blob/f3a3fed76cb74ecd0f46031f337576ce60f54fb2/docs/index.md
[3] https://github.com/apache/spark/blob/master/R/WINDOWS.md


2017-01-16 19:35 GMT+09:00 assaf.mendelson <as...@rsa.com>:

> Hi,
>
> In the documentation it says spark is supported on windows.
>
> The problem, however, is that the documentation description on windows is
> lacking. There are sources (such as https://jaceklaskowski.
> gitbooks.io/mastering-apache-spark/content/spark-tips-and-
> tricks-running-spark-windows.html and many more) which explain how to
> make spark run on windows, however, they all involve downloading a third
> party winutil.exe file.
>
> Since this file is downloaded from a repository belonging to a private
> person, this can be an issue (e.g. getting approval to install on a company
> computer can be an issue).
>
> There are tons of jira tickets on the subject (most are marked as
> duplicate or not a problem), however, I believe that if we say spark is
> supported on windows there should be a clear explanation on how to run it
> and one shouldn’t have to use executable from a private person.
>
>
>
> If indeed using winutil.exe is the correct solution, I believe it should
> be bundled to the spark binary distribution along with clear instructions
> on how to add it.
>
> Assaf.
>
> ------------------------------
> View this message in context: spark support on windows
> <http://apache-spark-developers-list.1001551.n3.nabble.com/spark-support-on-windows-tp20614.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>