You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mike Wheeler <ro...@gmail.com> on 2017/05/12 03:07:18 UTC
Best Practice for Enum in Spark SQL
Hi Spark Users,
I want to store Enum type (such as Vehicle Type: Car, SUV, Wagon) in my
data. My storage format will be parquet and I need to access the data from
Spark-shell, Spark SQL CLI, and hive. My questions:
1) Should I store my Enum type as String or store it as numeric encoding
(aka 1=Car, 2=SUV, 3=Wagon)?
2) If I choose String, any penalty in hard drive space or memory?
Thank you!
Mike
Re: Best Practice for Enum in Spark SQL
Posted by Anastasios Zouzias <zo...@gmail.com>.
Hi Mike,
FYI: Is you are using Spark 2.x, you might have issues with encoders if you
use a case class with Enumeration type field, see
https://issues.apache.org/jira/browse/SPARK-17248
For (1), (2), I would guess Int would be better (space-wise), but I am not
familiar with parquet's internals.
Best,
Anastasios
On Fri, May 12, 2017 at 5:07 AM, Mike Wheeler <ro...@gmail.com>
wrote:
> Hi Spark Users,
>
> I want to store Enum type (such as Vehicle Type: Car, SUV, Wagon) in my
> data. My storage format will be parquet and I need to access the data from
> Spark-shell, Spark SQL CLI, and hive. My questions:
>
> 1) Should I store my Enum type as String or store it as numeric encoding
> (aka 1=Car, 2=SUV, 3=Wagon)?
>
> 2) If I choose String, any penalty in hard drive space or memory?
>
> Thank you!
>
> Mike
>
--
-- Anastasios Zouzias
<az...@zurich.ibm.com>