You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Christopher Piggott <cp...@gmail.com> on 2018/04/20 01:23:53 UTC

Stream writing parquet files

I am trying to write some parquet files and running out of memory.  I'm
giving my workers each 16GB and the data is 102 columns * 65536 rows - not
really all that much.  The content of each row is a short string.

I am trying to create the file by dynamically allocating a StructType of
StructField objects.  I then tried various methods of building an Array or
List of Row objects for each of the 65,536 rows.  The last attempt was to
create an ArrayBuffer of the correct length.

In all cases, I run out of memory.

It occurs to me that what I really need is a way to generate and stream the
parquet files directly to an HDFS file.  I have 70,000+ of these files, so
for starters I'm OK with creating 70,000 parquet files as long as there's
some way I can merge them later.

Is there an approach for generating parquet files from spark (ultimately to
HDFS) that lets me put each row out one at a time, in a streaming fashion?

BTW I'm using spark 2.2.1 and whatever parquet library was bundled within.

--Chris

Re: Stream writing parquet files

Posted by Christopher Piggott <cp...@gmail.com>.

As a follow-up question, what happened
to org.apache.spark.sql.parquet.RowWriteSupport ?  It seems like it would
help me.

On Thu, Apr 19, 2018 at 9:23 PM, Christopher Piggott <cp...@gmail.com>
wrote:

> I am trying to write some parquet files and running out of memory.  I'm
> giving my workers each 16GB and the data is 102 columns * 65536 rows - not
> really all that much.  The content of each row is a short string.
>
> I am trying to create the file by dynamically allocating a StructType of
> StructField objects.  I then tried various methods of building an Array or
> List of Row objects for each of the 65,536 rows.  The last attempt was to
> create an ArrayBuffer of the correct length.
>
> In all cases, I run out of memory.
>
> It occurs to me that what I really need is a way to generate and stream
> the parquet files directly to an HDFS file.  I have 70,000+ of these files,
> so for starters I'm OK with creating 70,000 parquet files as long as
> there's some way I can merge them later.
>
> Is there an approach for generating parquet files from spark (ultimately
> to HDFS) that lets me put each row out one at a time, in a streaming
> fashion?
>
> BTW I'm using spark 2.2.1 and whatever parquet library was bundled within.
>
> --Chris
>
>