You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2016/10/06 04:18:24 UTC

spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

i just replaced out spark 2.0.0 install on yarn cluster with spark 2.0.1
and copied over the configs.

to give it a quick test i started spark-shell and created a dataset. i get
this:

16/10/05 23:55:13 WARN spark.SparkContext: Use an existing SparkContext,
some configuration may not take effect.
Spark context Web UI available at http://***:4040
Spark context available as 'sc' (master = yarn, app id =
application_1471212701720_1580).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_75)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import spark.implicits._
import spark.implicits._

scala> val x = List(1,2,3).toDS
org.apache.spark.SparkException: Unable to create database default as
failed to create its directory hdfs://dev/home/koert/spark-warehouse
  at
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:114)
  at
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.createDatabase(InMemoryCatalog.scala:108)
  at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:147)
  at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
  at
org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
  at
org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
  at
org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
  at
org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
  at
org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
  at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
  at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:423)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:380)
  at
org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:171)
  ... 50 elided

this did not happen in spark 2.0.0
the location it is trying to access makes little sense, since it is going
to hdfs but then it is looking for my local home directory (/home/koert
exists locally but not on hdfs).

i suspect the issue is SPARK-15899, but i am not sure. in the pullreq for
that WAREHOUSE_PATH got changed:
   val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
   val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
     .doc("The default location for managed databases and tables.")
     .doc("The default location for managed databases and tables.")
     .stringConf
 -    .createWithDefault("file:${system:user.dir}/spark-warehouse")
 +    .createWithDefault("${system:user.dir}/spark-warehouse")

notice how the file: got removed from the url, causing spark to look on
hdfs now since it is my default filesystem on the cluster. but
system:user.dir is still a local home directory. when combining the two you
get something that doesn't exist.

Re: spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

Posted by Koert Kuipers <ko...@tresata.com>.

if the intention is to create this on the default hadoop filesystem (and
not local), then maybe we can use FileSystem.getHomeDirectory()? it should
return the correct home directory on the relevant FileSystem (local or
hdfs).

if the intention is to create this only locally, then why bother using
hadoop filesystem api at all?

On Thu, Oct 6, 2016 at 9:45 AM, Koert Kuipers <ko...@tresata.com> wrote:

> well it seems to work if set spark.sql.warehouse.dir to
> /tmp/spark-warehouse in spark-defaults, and it creates it on hdfs.
>
> however can this directory safely be shared between multiple users running
> jobs?
>
> if not then i need to set this per user (instead of single setting in
> spark-defaults) which means i need to change the jobs, which means an
> upgrade for a production cluster running many jobs becomes more difficult.
>
> or can i create a setting in spark-defaults that includes a reference to
> the user? something like /tmp/{user}/spark-warehouse?
>
>
>
> On Thu, Oct 6, 2016 at 6:04 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Yeah I see the same thing. You can fix this by setting
>> spark.sql.warehouse.dir of course as a workaround. I restarted a
>> conversation about it at https://github.com/apache/s
>> park/pull/13868#pullrequestreview-3081020
>>
>> I think the question is whether spark-warehouse is always supposed to be
>> a local dir, or could be an HDFS dir? a change is needed either way, just
>> want to clarify what it is.
>>
>>
>> On Thu, Oct 6, 2016 at 5:18 AM Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i just replaced out spark 2.0.0 install on yarn cluster with spark 2.0.1
>>> and copied over the configs.
>>>
>>> to give it a quick test i started spark-shell and created a dataset. i
>>> get this:
>>>
>>> 16/10/05 23:55:13 WARN spark.SparkContext: Use an existing SparkContext,
>>> some configuration may not take effect.
>>> Spark context Web UI available at http://***:4040
>>> Spark context available as 'sc' (master = yarn, app id =
>>> application_1471212701720_1580).
>>> Spark session available as 'spark'.
>>> Welcome to
>>>       ____              __
>>>      / __/__  ___ _____/ /__
>>>     _\ \/ _ \/ _ `/ __/  '_/
>>>    /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>>>       /_/
>>>
>>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.7.0_75)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>>
>>> scala> import spark.implicits._
>>> import spark.implicits._
>>>
>>> scala> val x = List(1,2,3).toDS
>>> org.apache.spark.SparkException: Unable to create database default as
>>> failed to create its directory hdfs://dev/home/koert/spark-warehouse
>>>   at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.lifted
>>> Tree1$1(InMemoryCatalog.scala:114)
>>>   at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.create
>>> Database(InMemoryCatalog.scala:108)
>>>   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createD
>>> atabase(SessionCatalog.scala:147)
>>>   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(
>>> SessionCatalog.scala:89)
>>>   at org.apache.spark.sql.internal.SessionState.catalog$lzycomput
>>> e(SessionState.scala:95)
>>>   at org.apache.spark.sql.internal.SessionState.catalog(SessionSt
>>> ate.scala:95)
>>>   at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(Se
>>> ssionState.scala:112)
>>>   at org.apache.spark.sql.internal.SessionState.analyzer$lzycompu
>>> te(SessionState.scala:112)
>>>   at org.apache.spark.sql.internal.SessionState.analyzer(SessionS
>>> tate.scala:111)
>>>   at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed
>>> (QueryExecution.scala:49)
>>>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>>>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>>>   at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>>>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession
>>> .scala:423)
>>>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:380)
>>>   at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQ
>>> LImplicits.scala:171)
>>>   ... 50 elided
>>>
>>> this did not happen in spark 2.0.0
>>> the location it is trying to access makes little sense, since it is
>>> going to hdfs but then it is looking for my local home directory
>>> (/home/koert exists locally but not on hdfs).
>>>
>>> i suspect the issue is SPARK-15899, but i am not sure. in the pullreq
>>> for that WAREHOUSE_PATH got changed:
>>>    val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
>>>    val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
>>>      .doc("The default location for managed databases and tables.")
>>>      .doc("The default location for managed databases and tables.")
>>>      .stringConf
>>>  -    .createWithDefault("file:${system:user.dir}/spark-warehouse")
>>>  +    .createWithDefault("${system:user.dir}/spark-warehouse")
>>>
>>> notice how the file: got removed from the url, causing spark to look on
>>> hdfs now since it is my default filesystem on the cluster. but
>>> system:user.dir is still a local home directory. when combining the two you
>>> get something that doesn't exist.
>>>
>>>
>>>
>

Re: spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

Posted by Koert Kuipers <ko...@tresata.com>.

well it seems to work if set spark.sql.warehouse.dir to
/tmp/spark-warehouse in spark-defaults, and it creates it on hdfs.

however can this directory safely be shared between multiple users running
jobs?

if not then i need to set this per user (instead of single setting in
spark-defaults) which means i need to change the jobs, which means an
upgrade for a production cluster running many jobs becomes more difficult.

or can i create a setting in spark-defaults that includes a reference to
the user? something like /tmp/{user}/spark-warehouse?



On Thu, Oct 6, 2016 at 6:04 AM, Sean Owen <so...@cloudera.com> wrote:

> Yeah I see the same thing. You can fix this by setting
> spark.sql.warehouse.dir of course as a workaround. I restarted a
> conversation about it at https://github.com/apache/spark/pull/13868#
> pullrequestreview-3081020
>
> I think the question is whether spark-warehouse is always supposed to be a
> local dir, or could be an HDFS dir? a change is needed either way, just
> want to clarify what it is.
>
>
> On Thu, Oct 6, 2016 at 5:18 AM Koert Kuipers <ko...@tresata.com> wrote:
>
>> i just replaced out spark 2.0.0 install on yarn cluster with spark 2.0.1
>> and copied over the configs.
>>
>> to give it a quick test i started spark-shell and created a dataset. i
>> get this:
>>
>> 16/10/05 23:55:13 WARN spark.SparkContext: Use an existing SparkContext,
>> some configuration may not take effect.
>> Spark context Web UI available at http://***:4040
>> Spark context available as 'sc' (master = yarn, app id =
>> application_1471212701720_1580).
>> Spark session available as 'spark'.
>> Welcome to
>>       ____              __
>>      / __/__  ___ _____/ /__
>>     _\ \/ _ \/ _ `/ __/  '_/
>>    /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>>       /_/
>>
>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.7.0_75)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala> import spark.implicits._
>> import spark.implicits._
>>
>> scala> val x = List(1,2,3).toDS
>> org.apache.spark.SparkException: Unable to create database default as
>> failed to create its directory hdfs://dev/home/koert/spark-warehouse
>>   at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.
>> liftedTree1$1(InMemoryCatalog.scala:114)
>>   at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.
>> createDatabase(InMemoryCatalog.scala:108)
>>   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.
>> createDatabase(SessionCatalog.scala:147)
>>   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(
>> SessionCatalog.scala:89)
>>   at org.apache.spark.sql.internal.SessionState.catalog$
>> lzycompute(SessionState.scala:95)
>>   at org.apache.spark.sql.internal.SessionState.catalog(
>> SessionState.scala:95)
>>   at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(
>> SessionState.scala:112)
>>   at org.apache.spark.sql.internal.SessionState.analyzer$
>> lzycompute(SessionState.scala:112)
>>   at org.apache.spark.sql.internal.SessionState.analyzer(
>> SessionState.scala:111)
>>   at org.apache.spark.sql.execution.QueryExecution.
>> assertAnalyzed(QueryExecution.scala:49)
>>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>>   at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>>   at org.apache.spark.sql.SparkSession.createDataset(
>> SparkSession.scala:423)
>>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:380)
>>   at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(
>> SQLImplicits.scala:171)
>>   ... 50 elided
>>
>> this did not happen in spark 2.0.0
>> the location it is trying to access makes little sense, since it is going
>> to hdfs but then it is looking for my local home directory (/home/koert
>> exists locally but not on hdfs).
>>
>> i suspect the issue is SPARK-15899, but i am not sure. in the pullreq for
>> that WAREHOUSE_PATH got changed:
>>    val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
>>    val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
>>      .doc("The default location for managed databases and tables.")
>>      .doc("The default location for managed databases and tables.")
>>      .stringConf
>>  -    .createWithDefault("file:${system:user.dir}/spark-warehouse")
>>  +    .createWithDefault("${system:user.dir}/spark-warehouse")
>>
>> notice how the file: got removed from the url, causing spark to look on
>> hdfs now since it is my default filesystem on the cluster. but
>> system:user.dir is still a local home directory. when combining the two you
>> get something that doesn't exist.
>>
>>
>>

Re: spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

Posted by Sean Owen <so...@cloudera.com>.

Yeah I see the same thing. You can fix this by setting
spark.sql.warehouse.dir of course as a workaround. I restarted a
conversation about it at
https://github.com/apache/spark/pull/13868#pullrequestreview-3081020

I think the question is whether spark-warehouse is always supposed to be a
local dir, or could be an HDFS dir? a change is needed either way, just
want to clarify what it is.

On Thu, Oct 6, 2016 at 5:18 AM Koert Kuipers <ko...@tresata.com> wrote:

> i just replaced out spark 2.0.0 install on yarn cluster with spark 2.0.1
> and copied over the configs.
>
> to give it a quick test i started spark-shell and created a dataset. i get
> this:
>
> 16/10/05 23:55:13 WARN spark.SparkContext: Use an existing SparkContext,
> some configuration may not take effect.
> Spark context Web UI available at http://***:4040
> Spark context available as 'sc' (master = yarn, app id =
> application_1471212701720_1580).
> Spark session available as 'spark'.
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>       /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_75)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> import spark.implicits._
> import spark.implicits._
>
> scala> val x = List(1,2,3).toDS
> org.apache.spark.SparkException: Unable to create database default as
> failed to create its directory hdfs://dev/home/koert/spark-warehouse
>   at
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:114)
>   at
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.createDatabase(InMemoryCatalog.scala:108)
>   at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:147)
>   at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
>   at
> org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
>   at
> org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
>   at
> org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
>   at
> org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
>   at
> org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
>   at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>   at
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:423)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:380)
>   at
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:171)
>   ... 50 elided
>
> this did not happen in spark 2.0.0
> the location it is trying to access makes little sense, since it is going
> to hdfs but then it is looking for my local home directory (/home/koert
> exists locally but not on hdfs).
>
> i suspect the issue is SPARK-15899, but i am not sure. in the pullreq for
> that WAREHOUSE_PATH got changed:
>    val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
>    val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
>      .doc("The default location for managed databases and tables.")
>      .doc("The default location for managed databases and tables.")
>      .stringConf
>  -    .createWithDefault("file:${system:user.dir}/spark-warehouse")
>  +    .createWithDefault("${system:user.dir}/spark-warehouse")
>
> notice how the file: got removed from the url, causing spark to look on
> hdfs now since it is my default filesystem on the cluster. but
> system:user.dir is still a local home directory. when combining the two you
> get something that doesn't exist.
>
>
>