You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Uwe L. Korn" <uw...@xhochy.com> on 2017/04/06 09:35:17 UTC

Marking categorical data in Parquet schemas

Hello,

we often have the case that we want to treat some columns as categorical
data [1] (also called factors [2] in R) in memory. This is a column that
can only take a limit amount of values. These types also can have an
ordering. In Apache Arrow, we have defined the DictionaryType for this.
It takes an Index (also called categories in some contexts) and the
actual data as a separate integeral array. The most common use case is
that the indices/categories are strings thus engines that don't
explicitly support categorical data, the columns should be treated as
UTF8 data.

While this is similar to dictionary encoding and dictionary encoding
probably being the most efficient form to store categorical data, they
are semantically not the same (e.g. dictionaries are per RowGroup
whereas categories are defined on a per column basis).

To implement support for categorical data, several options come to my
mind:

1. Add an additional flag / metadata to the schema in
https://github.com/apache/parquet-format/blob/4bddbadf79e20a32152076fbedae0c3ce77fb531/src/main/thrift/parquet.thrift#L220
2. Add a new ConvertedType UTF8_CATEGORICAL
3. Add a new physical type for categoricals (this would equal the
implementation in Arrow)

Number 1 is the only options that would work well with old readers, 2+3
would produce files that cannot be read correctly by older
implementations.

[1] http://pandas.pydata.org/pandas-docs/stable/categorical.html
[2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
[3]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L576

--
Uwe L. Korn
uwelk@xhochy.com

Re: Marking categorical data in Parquet schemas

Posted by Wes McKinney <we...@gmail.com>.

hi Uwe,

Thanks for bringing this up.

I have a somewhat different opinion, which is that I don't think
categorical metadata belongs _formally_ in the Parquet format. The
reason is that database systems generally address storage of
categorical data using fact and dimension tables -- if you store data
in Parquet, and your set of categories need to expand, it's generally
not feasible to modify old data files to account for the expanded
category set.

Parquet uses dictionary encoding as a compression technique (combined
with RLE encoding, see e.g. the low entropy examples in
http://wesmckinney.com/blog/python-parquet-multithreading/). Parquet
as long term storage is distinct from Arrow's use case as an in-memory
data structure and transient IPC format, where handling in-memory
dictionary-encoded / categorical data makes more sense.

I do think it's reasonable to want to faithfully round trip
categorical data from R and Python to the Parquet format -- I would
instead like us to specify a KeyValue metadata convention that we can
all use to maximize interoperability between implementations.

Because dictionaries vary in size, even when we store them in Parquet
format we'll have a couple cases to handle:

* Small dictionaries -- all of the Parquet data pages contain dictionary indices
* Large dictionaries -- the encoder fell back to PLAIN encoding
because the dictionary page exceeded a size threshold

In the first case, we can avoid extra hashing/encoding when reading
the file by using the dictionary page directly. In the latter case, if
we want to construct the original in-memory data faithfully, we'll
have to hash.

- Wes

On Thu, Apr 6, 2017 at 5:35 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Hello,
>
> we often have the case that we want to treat some columns as categorical
> data [1] (also called factors [2] in R) in memory. This is a column that
> can only take a limit amount of values. These types also can have an
> ordering. In Apache Arrow, we have defined the DictionaryType for this.
> It takes an Index (also called categories in some contexts) and the
> actual data as a separate integeral array. The most common use case is
> that the indices/categories are strings thus engines that don't
> explicitly support categorical data, the columns should be treated as
> UTF8 data.
>
> While this is similar to dictionary encoding and dictionary encoding
> probably being the most efficient form to store categorical data, they
> are semantically not the same (e.g. dictionaries are per RowGroup
> whereas categories are defined on a per column basis).
>
> To implement support for categorical data, several options come to my
> mind:
>
> 1. Add an additional flag / metadata to the schema in
> https://github.com/apache/parquet-format/blob/4bddbadf79e20a32152076fbedae0c3ce77fb531/src/main/thrift/parquet.thrift#L220
> 2. Add a new ConvertedType UTF8_CATEGORICAL
> 3. Add a new physical type for categoricals (this would equal the
> implementation in Arrow)
>
> Number 1 is the only options that would work well with old readers, 2+3
> would produce files that cannot be read correctly by older
> implementations.
>
> [1] http://pandas.pydata.org/pandas-docs/stable/categorical.html
> [2] https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
> [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L576
>
> --
>   Uwe L. Korn
>   uwelk@xhochy.com