You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 马阳阳 <ma...@163.com> on 2020/05/11 09:37:29 UTC

Spark wrote to Hive table. file content format and fileformat in metadata doesn't match

Hi,
We are currently trying to replace hive with Spark thrift server.
We encounter a problem. With the following sql:
    create table test_db.sink_test as select [some columns] from test_db.test_source
After the SQL run successfully, we queried data from test_db.test_sink. The data is
gibberish. After some inspection, we found that test_db.test_sink has orc file (which can
be read with spark.read.orc) on hdfs, but the metadata for it is text. Using spark.read.orc().show,
the output column names are not column names from test_db.test_source, but something like:
|_col0|   _col1|   _col2|               _col3|     _col4|            _col5|_col6|_col7|    _col8|_col9|    _col10|    _col11|_col12|

What is mysterious is that after rerunning the SQL, without any changes, the table will be
alright (the file content and file format in metadata matches).

I wonder if anyone has encountered the same problem.

Appreciate for any response.