You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by chinchu <ch...@gmail.com> on 2014/09/19 07:53:36 UTC

spark-submit command-line with --files

Hi,

I am running spark-1.1.0 and I want to pass in a file (that contains java
serialized objects used to initialize my program) to the App main program. I
am using the --files option but I am not able to retrieve the file in the
main_class. It reports a null pointer exception. [I tried both local &
yarn-cluster with the same result]. I am using the
SparkFiles.get("myobject.ser") to get the file. Am I doing something wrong ?

CMD:
bin/spark-submit --name  Test --class
com.test.batch.modeltrainer.ModelTrainerMain \
  --master local --files /tmp/myobject.ser --verbose 
/opt/test/lib/spark-test.jar

com.test.batch.modeltrainer.ModelTrainerMain.scala
37: val serFile = SparkFiles.get("myobject.ser")

Exception:
Exception in thread "main" java.lang.NullPointerException
  at org.apache.spark.SparkFiles$.getRootDirectory(SparkFiles.scala:37)
  at org.apache.spark.SparkFiles$.get(SparkFiles.scala:31)
  at
com.test.batch.modeltrainer.ModelTrainerMain$.main(ModelTrainerMain.scala:37)
  at
com.test.batch.modeltrainer.ModelTrainerMain.main(ModelTrainerMain.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Looking at the Scala code for SparkFiles:37, it looks like SparkEnv.get is
getting a null ..
Thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-command-line-with-files-tp14645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark-submit command-line with --files

Posted by chinchu <ch...@gmail.com>.
Thanks Marcelo. The code trying to read the file always runs in the driver. I
understand the problem with other master-deployment but will it work in
local, yarn-client & yarn-cluster deployments.. that's all I care for now
:-)

Also what is the suggested way to do something like this ? Put the file on
hdfs ?

-C




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-command-line-with-files-tp14645p14753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark-submit command-line with --files

Posted by Marcelo Vanzin <va...@cloudera.com>.
Hi chinchu,

Where does the code trying to read the file run? Is it running on the
driver or on some executor?

If it's running on the driver, in yarn-cluster mode, the file should
have been copied to the application's work directory before the driver
is started. So hopefully just doing "new FileInputStream(foo)" will
just work.

That does make some assumptions about the code being run in
yarn-cluster mode, though, and it may not work with a different master
deployment. I'm not sure, without looking further, what are the
expected semantics for reading these files from code not running in
the executors.


On Sat, Sep 20, 2014 at 1:14 AM, chinchu <ch...@gmail.com> wrote:
> Thanks Andrew.
>
> I understand the problem a little better now. There was a typo in my earlier
> mail & a bug in the code (causing the NPE in SparkFiles). I am using the
> --master yarn-cluster (not local). And in this mode, the
> com.test.batch.modeltrainer.ModelTrainerMain - my main-class will run on the
> application master in yarn (3-node cluster) & the serialized file is on my
> laptop:/tmp/myobject.ser. That is the reason I was using SparkFiles.get() to
> get this file (and not just doing a new File("/tmp/myobject.ser"))
>
> 37: val serFile = SparkFiles.get("myobject.ser")
> 38: val argsMap =  deSerializeMapFromFile(serFile)
>
> But this gets me a FileNotFoundException:
> /tmp/spark-3292c9e3-db06-43b1-89f1-423f40e8e84b/myobject.ser in
> deSerializeMapFromFile(xxx). This runs in the  spark "driver" and not the
> executor, correct ? & that's why its probably not finding the file.
>
> *
> Here's what I am trying to do:
> my-laptop (has the /tmp/myobject.ser & /opt/test/lib/spark-test.jar)
> launches spark-submit ---files .. ----> hadoop-yarn-cluster[3 nodes]
> *
> and on my laptop:$HADOOP_CONF_DIR, I have the configuration that points to
> this 3-node yarn cluster.
>
> *What is the right way to get to this file (myobject.ser) in my main-class
> (when running in spark-driver in yarn & not the executor) ?*
>
> Thanks again
> -C
>
> PS: java.io.FileNotFoundException:
> /tmp/spark-3292c9e3-db06-43b1-89f1-423f40e8e84b/myobject.ser (No such file
> or directory)
>   at java.io.FileInputStream.open(Native Method)
>   at java.io.FileInputStream.<init>(FileInputStream.java:146)
>   at java.io.FileInputStream.<init>(FileInputStream.java:101)
>   at
> com.test.batch.modeltrainer.ModelTrainerMain$.deSerializeMapFromFile(ModelTrainerMain.scala:96)
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-command-line-with-files-tp14645p14719.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark-submit command-line with --files

Posted by chinchu <ch...@gmail.com>.
Thanks Andrew.

I understand the problem a little better now. There was a typo in my earlier
mail & a bug in the code (causing the NPE in SparkFiles). I am using the
--master yarn-cluster (not local). And in this mode, the
com.test.batch.modeltrainer.ModelTrainerMain - my main-class will run on the
application master in yarn (3-node cluster) & the serialized file is on my
laptop:/tmp/myobject.ser. That is the reason I was using SparkFiles.get() to
get this file (and not just doing a new File("/tmp/myobject.ser"))

37: val serFile = SparkFiles.get("myobject.ser")
38: val argsMap =  deSerializeMapFromFile(serFile)

But this gets me a FileNotFoundException:
/tmp/spark-3292c9e3-db06-43b1-89f1-423f40e8e84b/myobject.ser in
deSerializeMapFromFile(xxx). This runs in the  spark "driver" and not the
executor, correct ? & that's why its probably not finding the file.

*
Here's what I am trying to do:
my-laptop (has the /tmp/myobject.ser & /opt/test/lib/spark-test.jar)
launches spark-submit ---files .. ----> hadoop-yarn-cluster[3 nodes]
*
and on my laptop:$HADOOP_CONF_DIR, I have the configuration that points to
this 3-node yarn cluster.

*What is the right way to get to this file (myobject.ser) in my main-class
(when running in spark-driver in yarn & not the executor) ?*

Thanks again
-C

PS: java.io.FileNotFoundException:
/tmp/spark-3292c9e3-db06-43b1-89f1-423f40e8e84b/myobject.ser (No such file
or directory)
  at java.io.FileInputStream.open(Native Method)
  at java.io.FileInputStream.<init>(FileInputStream.java:146)
  at java.io.FileInputStream.<init>(FileInputStream.java:101)
  at
com.test.batch.modeltrainer.ModelTrainerMain$.deSerializeMapFromFile(ModelTrainerMain.scala:96)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-command-line-with-files-tp14645p14719.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark-submit command-line with --files

Posted by chinchu <ch...@gmail.com>.
Thanks Andrew. that helps

On Fri, Sep 19, 2014 at 5:47 PM, Andrew Or-2 [via Apache Spark User List] <
ml-node+s1001560n14708h84@n3.nabble.com> wrote:

> Hey just a minor clarification, you _can_ use SparkFiles.get in your
> application only if it runs on the executors, e.g. in the following way:
>
> sc.parallelize(1 to 100).map { i => SparkFiles.get("my.file") }.collect()
>
> But not in general (otherwise NPE, as in your case). Perhaps this should
> be documented more clearly. Thanks to Marcelo for pointing this out.
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-command-line-with-files-tp14645p14708.html
>  To unsubscribe from spark-submit command-line with --files, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=14645&code=Y2hpbmNodS5zdXBAZ21haWwuY29tfDE0NjQ1fC0zNjgxNTIxNTU=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-command-line-with-files-tp14645p14717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark-submit command-line with --files

Posted by Andrew Or <an...@databricks.com>.
Hey just a minor clarification, you _can_ use SparkFiles.get in your
application only if it runs on the executors, e.g. in the following way:

sc.parallelize(1 to 100).map { i => SparkFiles.get("my.file") }.collect()

But not in general (otherwise NPE, as in your case). Perhaps this should be
documented more clearly. Thanks to Marcelo for pointing this out.

Re: spark-submit command-line with --files

Posted by Andrew Or <an...@databricks.com>.
Hi Chinchu,

SparkEnv is an internal class that is only meant to be used within Spark.
Outside of Spark, it will be null because there are no executors or driver
to start an environment for. Similarly, SparkFiles is meant to be used
internally (though it's privacy settings should be modified to reflect
that). Is there a reason why you need to pass the serialized objects this
way? Can't you access the deserialized form from your application?

Andrew

2014-09-18 22:53 GMT-07:00 chinchu <ch...@gmail.com>:

> Hi,
>
> I am running spark-1.1.0 and I want to pass in a file (that contains java
> serialized objects used to initialize my program) to the App main program.
> I
> am using the --files option but I am not able to retrieve the file in the
> main_class. It reports a null pointer exception. [I tried both local &
> yarn-cluster with the same result]. I am using the
> SparkFiles.get("myobject.ser") to get the file. Am I doing something wrong
> ?
>
> CMD:
> bin/spark-submit --name  Test --class
> com.test.batch.modeltrainer.ModelTrainerMain \
>   --master local --files /tmp/myobject.ser --verbose
> /opt/test/lib/spark-test.jar
>
> com.test.batch.modeltrainer.ModelTrainerMain.scala
> 37: val serFile = SparkFiles.get("myobject.ser")
>
> Exception:
> Exception in thread "main" java.lang.NullPointerException
>   at org.apache.spark.SparkFiles$.getRootDirectory(SparkFiles.scala:37)
>   at org.apache.spark.SparkFiles$.get(SparkFiles.scala:31)
>   at
>
> com.test.batch.modeltrainer.ModelTrainerMain$.main(ModelTrainerMain.scala:37)
>   at
> com.test.batch.modeltrainer.ModelTrainerMain.main(ModelTrainerMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Looking at the Scala code for SparkFiles:37, it looks like SparkEnv.get is
> getting a null ..
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-command-line-with-files-tp14645.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>