You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/08/19 11:19:46 UTC
[jira] [Resolved] (SPARK-9600) DataFrameWriter.saveAsTable always
writes data to "/user/hive/warehouse"
[ https://issues.apache.org/jira/browse/SPARK-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheng Lian resolved SPARK-9600.
-------------------------------
Resolution: Not A Problem
> DataFrameWriter.saveAsTable always writes data to "/user/hive/warehouse"
> ------------------------------------------------------------------------
>
> Key: SPARK-9600
> URL: https://issues.apache.org/jira/browse/SPARK-9600
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.1, 1.5.0
> Reporter: Cheng Lian
> Assignee: Sudhakar Thota
> Priority: Blocker
> Attachments: SPARK-9600-fl1.txt
>
>
> Get a clean Spark 1.4.1 build:
> {noformat}
> $ git checkout v1.4.1
> $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 clean assembly/assembly
> {noformat}
> Stop any running local Hadoop instance and unset all Hadoop environment variables, so that we force Spark run with local file system only:
> {noformat}
> $ unset HADOOP_CONF_DIR
> $ unset HADOOP_PREFIX
> $ unset HADOOP_LIBEXEC_DIR
> $ unset HADOOP_CLASSPATH
> {noformat}
> In this way we also ensure that the default Hive warehouse location points to local file system {{file:///user/hive/warehouse}}. Now we create warehouse directories for testing:
> {noformat}
> $ sudo rm -rf /user # !! WARNING: IT'S /user RATHER THAN /usr !!
> $ sudo mkdir -p /user/hive/{warehouse,warehouse_hive13}
> $ sudo chown -R lian:staff /user
> $ tree /user
> /user
> └── hive
> ├── warehouse
> └── warehouse_hive13
> {noformat}
> Create a minimal {{hive-site.xml}}, only override the warehouse location, put it under {{$SPARK_HOME/conf}}:
> {noformat}
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
> <property>
> <name>hive.metastore.warehouse.dir</name>
> <value>file:///user/hive/warehouse_hive13</value>
> </property>
> </configuration>
> {noformat}
> Now run our test snippets with {{pyspark}}:
> {noformat}
> $ ./bin/pyspark
> In [1]: sqlContext.range(10).coalesce(1).write.saveAsTable("ds")
> {noformat}
> Check warehouse directories:
> {noformat}
> $ tree /user
> /user
> └── hive
> ├── warehouse
> │ └── ds
> │ ├── _SUCCESS
> │ ├── _common_metadata
> │ ├── _metadata
> │ └── part-r-00000-46e4b32a-5c4d-4dba-b8d6-8d30ae910dc9.gz.parquet
> └── warehouse_hive13
> └── ds
> {noformat}
> Here you may notice the weird part: we have {{ds}} under both {{warehouse}} and {{warehouse_hive13}}, but data are only written into the former.
> Now let's try HiveQl:
> {noformat}
> In [2]: sqlContext.range(10).coalesce(1).registerTempTable("t")
> In [3]: sqlContext.sql("CREATE TABLE ds_ctas AS SELECT * FROM t")
> {noformat}
> Check the directories again:
> {noformat}
> $ tree /user
> /user
> └── hive
> ├── warehouse
> │ └── ds
> │ ├── _SUCCESS
> │ ├── _common_metadata
> │ ├── _metadata
> │ └── part-r-00000-46e4b32a-5c4d-4dba-b8d6-8d30ae910dc9.gz.parquet
> └── warehouse_hive13
> ├── ds
> └── ds_ctas
> ├── _SUCCESS
> └── part-00000
> {noformat}
> So HiveQl works fine. (Hive never writes Parquet summary files, so {{_common_metadata}} and {{_metadata}} are missing in {{ds_ctas}}).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org