You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Mick Davies <mi...@gmail.com> on 2019/04/16 12:09:16 UTC

Support for arrays parquet vectorized reader

Hi,

I'm working with a medical data model that uses arrays of simple types to
represent things like the drug exposures and conditions that are associated
with a patient.

Using this model, patient data is co-located and is consequently processed
by Spark more efficiently. The data is stored in parquet format.

In order to improve processing time we have experimented with adding support
for simple arrays to the parquet vectorized reader.

This change gives us significant performance improvements, > 4x faster for
some operations.

I was wondering whether any enhancements like this have been considered or
whether this work is something that could be useful to the wider community.


Regards

Mick Davies





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org