You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by iv...@gmail.com, iv...@gmail.com on 2018/07/21 04:24:57 UTC

Create parquet file chunks using Parquet CPP library

Hi,

I want to convert huge dataset (ex. 1TB) from database to parquet file, because of file system size and memory limitation, it's not possible to create one single parquet file and store it in file system, instead I plan to read data from db as a small chunk (ex. 100 or 1000 rows) at a time so that I create a row group for this chunk and as soon as binary(parquet conversion) for this chunk(single row group) is ready I upload it to S3 and don't wait for whole parquet binary file to be finished.

I am using parquet-cpp library for this project and I can see that library supports very limited functionality(take whole table information and store it as one single parquet file in the file system).  <== which is not possible in my case

Is it possible to use parquet cpp library in below way?
instead of providing file name to the library if I can provide named pipe (FIFO) then whenever library update the content into FIFO, one process upload the content directly to the S3 in the background hence we can create one big parquet file without storing whole file in file system or memory. 
- To achieve that I tried passing FIFO file name instead of actual file name to the library but I got 
                    “Parquet write error: Arrow error: IOError: lseek failed" ERROR
 is it because parquet cpp library does not support FIFO as file name? If yes, is there other way I can create parquet file? 
- I can create one parquet file for each chunk (100 or 1000 rows) but this will create huge no of parquet files instead I want to create one parquet file for 100s or 1000s of chunks (creating partial parquet file for each chunk and upload immediately to S3) even if I can not store all of these chunks together in memory or filesystem.

Hope my question is clear :) thanks in advance!

Re: Create parquet file chunks using Parquet CPP library

Posted by Deepak Majeti <ma...@gmail.com>.

To achieve your goal, you must implement your own OutputStream for S3.
You can see an example implementation, InMemoryOutputStream in the files
below.

https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/util/memory.h
https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/util/memory.cc

With the S3OutputStream implementation, you can then create a
ParquetFileWriter using the Open() API in the file below

https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/file_writer.cc


On Sat, Jul 21, 2018 at 12:25 AM ivywuyzl@gmail.com <iv...@gmail.com>
wrote:

> Hi,
>
> I want to convert huge dataset (ex. 1TB) from database to parquet file,
> because of file system size and memory limitation, it's not possible to
> create one single parquet file and store it in file system, instead I plan
> to read data from db as a small chunk (ex. 100 or 1000 rows) at a time so
> that I create a row group for this chunk and as soon as binary(parquet
> conversion) for this chunk(single row group) is ready I upload it to S3 and
> don't wait for whole parquet binary file to be finished.
>
> I am using parquet-cpp library for this project and I can see that library
> supports very limited functionality(take whole table information and store
> it as one single parquet file in the file system).  <== which is not
> possible in my case
>
> Is it possible to use parquet cpp library in below way?
> instead of providing file name to the library if I can provide named pipe
> (FIFO) then whenever library update the content into FIFO, one process
> upload the content directly to the S3 in the background hence we can create
> one big parquet file without storing whole file in file system or memory.
> - To achieve that I tried passing FIFO file name instead of actual file
> name to the library but I got
>                     “Parquet write error: Arrow error: IOError: lseek
> failed" ERROR
>  is it because parquet cpp library does not support FIFO as file name? If
> yes, is there other way I can create parquet file?
> - I can create one parquet file for each chunk (100 or 1000 rows) but this
> will create huge no of parquet files instead I want to create one parquet
> file for 100s or 1000s of chunks (creating partial parquet file for each
> chunk and upload immediately to S3) even if I can not store all of these
> chunks together in memory or filesystem.
>
> Hope my question is clear :) thanks in advance!
>


-- 
regards,
Deepak Majeti