You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Sam Wright <Sa...@utah.edu> on 2020/07/22 00:48:14 UTC

rust-parquet write performance [help request]

(I should preface that I am extremely new with parquet)

I have an application (written in rust) that logs high-frequency data to an
sqlite database. Our analysts would prefer to move the data to parquet.

I have written a simple proof of concept based on the 0.16.0 release of
parquet, and am getting quite poor write performance. I would like to
verify that I am approaching the problem correctly and using the tooling
properly.

My data is shaped entirely flat containing ~1200 columns. Something like:

message my_data {
    required data1 INT32;
    required data2 INT32;
    ...
    required data1200 INT32;
}

The program flow simply mirrors the example shown here
<https://docs.rs/parquet/0.16.0/parquet/column/index.html>, and is as
follows -- I open a file with a SerializedFileWriter, from which I get a
RowGroupWriter. Using that I get a typed ColumnWriter for each column and
call write_batch with its new data. (My supposition is that this
effectively creates a transaction for each column each update opening and
closing the file to make many small writes)

In a parallel effort, another developer wrote another proof of concept
using the cpp variant of parquet. This version is many, many times faster.
They describe their flow as follows -- I use a ParquetFileWriter to create
an AppendBufferedRowGroup. From that I get a writer for the specific type
of data that I want to write and I call the WriteBatch method on it. After
I have written N rows (default N = 1000) I flush the FileOutputStreamthat
the ParquetFileWriter is using and finally I close the ParquetFileWriter. I
do that for each batch of N.

So my questions come in multiple parts -

   1.

   Is my rust workflow "correct"? I recognize that the reference
   implementation I am using involves nested data structures, where my
   use-case has none (no repetition or definition values).
   2.

   Is there a way to get the workflow outlined in the cpp example, but
   using the rust API? I recognize that the rust parquet writer is a WIP
   3.

   If the API does not support this buffered functionality (yet?) is there
   a timeline for when it will?


- Sam