You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Suchintak Patnaik (Jira)" <ji...@apache.org> on 2019/09/24 18:59:00 UTC

[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

     [ https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suchintak Patnaik updated SPARK-29234:
--------------------------------------
    Description: 
When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name              data_type               comment
col                     array<string>           from deserializer

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-00000-55920574-eeb5-48b7-856d-e5c27e85ba12_00000.c000.snappy.orc not a SequenceFile

While reading the same table in Spark also giving error.

df = spark.


  was:
When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name              data_type               comment
col                     array<string>           from deserializer

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-00000-55920574-eeb5-48b7-856d-e5c27e85ba12_00000.c000.snappy.orc not a SequenceFile



> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> -----------------------------------------------------------------------
>
>                 Key: SPARK-29234
>                 URL: https://issues.apache.org/jira/browse/SPARK-29234
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Suchintak Patnaik
>            Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name              data_type               comment
> col                     array<string>           from deserializer
> # Storage Information
> SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:            org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:           org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-00000-55920574-eeb5-48b7-856d-e5c27e85ba12_00000.c000.snappy.orc not a SequenceFile
> While reading the same table in Spark also giving error.
> df = spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org