You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mike Wheeler <ro...@gmail.com> on 2017/05/12 03:07:18 UTC

Best Practice for Enum in Spark SQL

Hi Spark Users,

I want to store Enum type (such as Vehicle Type: Car, SUV, Wagon)  in my
data. My storage format will be parquet and I need to access the data from
Spark-shell, Spark SQL CLI, and hive. My questions:

1) Should I store my Enum type as String or store it as numeric encoding
(aka 1=Car, 2=SUV, 3=Wagon)?

2) If I choose String, any penalty in hard drive space or memory?

Thank you!

Mike

Re: Best Practice for Enum in Spark SQL

Posted by Anastasios Zouzias <zo...@gmail.com>.

Hi Mike,

FYI: Is you are using Spark 2.x, you might have issues with encoders if you
use a case class with Enumeration type field, see
https://issues.apache.org/jira/browse/SPARK-17248

For (1), (2), I would guess Int would be better (space-wise), but I am not
familiar with parquet's internals.

Best,
Anastasios

On Fri, May 12, 2017 at 5:07 AM, Mike Wheeler <ro...@gmail.com>
wrote:

> Hi Spark Users,
>
> I want to store Enum type (such as Vehicle Type: Car, SUV, Wagon)  in my
> data. My storage format will be parquet and I need to access the data from
> Spark-shell, Spark SQL CLI, and hive. My questions:
>
> 1) Should I store my Enum type as String or store it as numeric encoding
> (aka 1=Car, 2=SUV, 3=Wagon)?
>
> 2) If I choose String, any penalty in hard drive space or memory?
>
> Thank you!
>
> Mike
>



-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>