You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Jiayuan Chen <ha...@gmail.com> on 2018/12/10 22:29:47 UTC

parquet-arrow estimate file size

Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to
seek precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when
writing from existing arrow table. I saw a reply in 2017 on this
StackOverflow post (
https://stackoverflow.com/questions/45572962/how-can-i-write-streaming-row-oriented-data-using-parquet-cpp-without-buffering)
and wondering if the following implementation is currently possible: Feed
data into the Arrow table, until at a point that the buffered data can be
converted to a Parquet file (e.g. of size 256 MB, instead of a fix number
of rows), and then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column
writer's size in bytes in the low-level API, but seems this is still not
yet available for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan

Re: parquet-arrow estimate file size

Posted by Jiayuan Chen <ha...@gmail.com>.

So seems like there is no solution to implement such mechanism using the
low-level API? I tried to dump the arrow::Buffer after each rowgroup is
completed, but looks like it is not a clear cut, that pages starting from
the second rowgroup became unreadable (the schema is correct tho).

If this solution does not exist, I will get back to the high level API that
uses a in-memory Arrow table then.




On Tue, Dec 11, 2018 at 8:17 AM Lee, David <Da...@blackrock.com> wrote:

> In my experience and experiments it is really hard to approximate target
> sizes. A single parquet file with a single row group could be 20% larger
> than a parquet files with 20 row groups because if you have a lot of rows
> with a lot of data variety you can lose dictionary encoding options. I
> predetermine my row group sizes by creating them as files and then write
> them to a single parquet file.
>
> A better approach would probably be to write the row group to a single
> file and once the size exceeds your target size, remove the last row group
> written and start a new file with it, but I don't think there is a method
> to remove a row group right now.
>
> Another option would be to write the row group out as a file object in
> memory to predetermine its size before adding it as a row group in a
> parquet file.
>
>
> -----Original Message-----
> From: Wes McKinney <we...@gmail.com>
> Sent: Tuesday, December 11, 2018 7:16 AM
> To: Parquet Dev <de...@parquet.apache.org>
> Subject: Re: parquet-arrow estimate file size
>
> External Email: Use caution with links and attachments
>
>
> hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to
> control or examine the size of files as they are being written.
> Ideally we would develop an interface to write a sequence of
> arrow::RecordBatch objects that would automatically move on to a new file
> once a certain approximate target size has been reached in an existing
> file. There's a number of moving parts that would need to be created to
> make this possible.
>
> - Wes
> On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal <Ha...@mathworks.co.uk>
> wrote:
> >
> > I think if I've understood the problem correctly, you could use the
> > parquet::arrow::FileWriter
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> > _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128&d=DwIFaQ&c=zU
> > O0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=r
> > rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko&s=zQJ4skn8jLtkXiPTGWljgDyof
> > gJTKIAyAeCBwHQuamw&e=
> >
> > The basic pattern is to use an object to manage the FileWriter lifetime,
> call the WriteTable method for each row group, and close it when you are
> done.  My understanding is that each call to WriteTable will append a new
> row group which should allow you to incrementally write an out-of-memory
> dataset.  I realize now that I haven't tested this myself so it would be
> good to double-check this with someone more experienced with the
> parquet-cpp APIs.
> >
> > On 12/11/18, 12:54 AM, "Jiayuan Chen" <ha...@gmail.com> wrote:
> >
> >     Thanks for the suggestion, will do.
> >
> >     Since such high-level API is not yet implemented in the parquet-cpp
> >     project, I have to turn back to use the API newly introduced in the
> >     low-level API, that calculates the Parquet file size when adding
> data into
> >     the column writers. I have another question on that part:
> >
> >     Is there any sample code & advice that I can follow to be able to
> stream
> >     the Parquet file on a per rowgroup basis? In order words, to restrict
> >     memory usage but still create big enough Parquet file, I would like
> to
> >     create relatively small rowgroup in memory using
> InMemoryOutputStream(),
> >     and dump the buffer contents to my external stream, after completing
> each
> >     row group, until a big file with several rowgroups is finished.
> However, my
> >     attempt to manipulate the underline arrow::Buffer have failed, that
> the
> >     pages starting from the second rowgroup are unreadable.
> >
> >     Thanks!
> >
> >     On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> >     > hi Jiayuan,
> >     >
> >     > To your question
> >     >
> >     > > Would this be in the roadmap?
> >     >
> >     > I doubt there would be any objections to adding this feature to the
> >     > Arrow writer API -- please feel free to open a JIRA issue to
> describe
> >     > how the API might work in C++. Note there is no formal roadmap in
> this
> >     > project.
> >     >
> >     > - Wes
> >     > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <ha...@gmail.com>
> wrote:
> >     > >
> >     > > Thanks for the Python solution. However, is there a solution in
> C++ that
> >     > I
> >     > > can create such Parquet file with only in-memory buffer, using
> >     > parquet-cpp
> >     > > library?
> >     > >
> >     > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <
> David.Lee@blackrock.com>
> >     > wrote:
> >     > >
> >     > > > Resending.. Somehow I lost some line feeds in the previous
> reply..
> >     > > >
> >     > > > import os
> >     > > > import pyarrow.parquet as pq
> >     > > > import glob as glob
> >     > > >
> >     > > > max_target_size = 134217728
> >     > > > target_size = max_target_size * .95
> >     > > > # Directory where parquet files are saved
> >     > > > working_directory = '/tmp/test'
> >     > > > files_dict = dict()
> >     > > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> >     > > > files.sort()
> >     > > > for file in files:
> >     > > >     files_dict[file] = os.path.getsize(file)
> >     > > > print("Merging parquet files")
> >     > > > temp_file = os.path.join(working_directory, "temp.parquet")
> >     > > > file_no = 0
> >     > > > for file in files:
> >     > > >     if file in files_dict:
> >     > > >         file_no = file_no + 1
> >     > > >         file_name = os.path.join(working_directory,
> >     > str(file_no).zfill(4)
> >     > > > + ".parquet")
> >     > > >         print("Saving to parquet file " + file_name)
> >     > > >         # Just rename file if the file size is in target range
> >     > > >         if files_dict[file] > target_size:
> >     > > >             del files_dict[file]
> >     > > >             os.rename(file, file_name)
> >     > > >             continue
> >     > > >         merge_list = list()
> >     > > >         file_size = 0
> >     > > >         # Find files to merge together which add up to less
> than 128
> >     > megs
> >     > > >         for k, v in files_dict.items():
> >     > > >             if file_size + v <= max_target_size:
> >     > > >                 print("Adding file " + k + " to merge list")
> >     > > >                 merge_list.append(k)
> >     > > >                 file_size = file_size + v
> >     > > >         # Just rename file if there is only one file to merge
> >     > > >         if len(merge_list) == 1:
> >     > > >             del files_dict[merge_list[0]]
> >     > > >             os.rename(merge_list[0], file_name)
> >     > > >             continue
> >     > > >         # Merge smaller files into one large file. Read row
> groups from
> >     > > > each file and add them to the new file.
> >     > > >         schema = pq.read_schema(file)
> >     > > >         print("Saving to new parquet file")
> >     > > >         writer = pq.ParquetWriter(temp_file, schema=schema,
> >     > > > use_dictionary=True, compression='snappy')
> >     > > >         for merge in merge_list:
> >     > > >             parquet_file = pq.ParquetFile(merge)
> >     > > >             print("Writing " + merge + " to new parquet file")
> >     > > >             for i in range(parquet_file.num_row_groups):
> >     > > >
>  writer.write_table(parquet_file.read_row_group(i))
> >     > > >             del files_dict[merge]
> >     > > >             os.remove(merge)
> >     > > >         writer.close()
> >     > > >         os.rename(temp_file, file_name)
> >     > > >
> >     > > >
> >     > > > -----Original Message-----
> >     > > > From: Jiayuan Chen <ha...@gmail.com>
> >     > > > Sent: Monday, December 10, 2018 2:30 PM
> >     > > > To: dev@parquet.apache.org
> >     > > > Subject: parquet-arrow estimate file size
> >     > > >
> >     > > > External Email: Use caution with links and attachments
> >     > > >
> >     > > >
> >     > > > Hello,
> >     > > >
> >     > > > I am a Parquet developer in the Bay Area, and I am writing
> this email
> >     > to
> >     > > > seek precious help on writing Parquet file from Arrow.
> >     > > >
> >     > > > My goal is to control the size (in bytes) of the output
> Parquet file
> >     > when
> >     > > > writing from existing arrow table. I saw a reply in 2017 on
> this
> >     > > > StackOverflow post (
> >     > > >
> >     > > >
> >     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
> >     > > > )
> >     > > > and wondering if the following implementation is currently
> possible:
> >     > Feed
> >     > > > data into the Arrow table, until at a point that the buffered
> data can
> >     > be
> >     > > > converted to a Parquet file (e.g. of size 256 MB, instead of a
> fix
> >     > number
> >     > > > of rows), and then use WriteTable() to create such Parquet
> file.
> >     > > >
> >     > > > I saw that parquet-cpp recently introduced API to control the
> column
> >     > > > writer's size in bytes in the low-level API, but seems this is
> still
> >     > not
> >     > > > yet available for the arrow-parquet API. Would this be in the
> roadmap?
> >     > > >
> >     > > > Thanks,
> >     > > > Jiayuan
> >     > > >
> >     > > >
> >     > > > This message may contain information that is confidential or
> >     > privileged.
> >     > > > If you are not the intended recipient, please advise the sender
> >     > immediately
> >     > > > and delete this message. See
> >     > > >
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> >     > > > further information.  Please refer to
> >     > > > http://www.blackrock.com/corporate/compliance/privacy-policy
> for more
> >     > > > information about BlackRock’s Privacy Policy.
> >     > > >
> >     > > > For a list of BlackRock's office addresses worldwide, see
> >     > > > http://www.blackrock.com/corporate/about-us/contacts-locations
> .
> >     > > >
> >     > > > © 2018 BlackRock, Inc. All rights reserved.
> >     > > >
> >     >
> >
> >
>

RE: parquet-arrow estimate file size

Posted by "Lee, David" <Da...@blackrock.com>.

In my experience and experiments it is really hard to approximate target sizes. A single parquet file with a single row group could be 20% larger than a parquet files with 20 row groups because if you have a lot of rows with a lot of data variety you can lose dictionary encoding options. I predetermine my row group sizes by creating them as files and then write them to a single parquet file.

A better approach would probably be to write the row group to a single file and once the size exceeds your target size, remove the last row group written and start a new file with it, but I don't think there is a method to remove a row group right now.

Another option would be to write the row group out as a file object in memory to predetermine its size before adding it as a row group in a parquet file.


-----Original Message-----
From: Wes McKinney <we...@gmail.com> 
Sent: Tuesday, December 11, 2018 7:16 AM
To: Parquet Dev <de...@parquet.apache.org>
Subject: Re: parquet-arrow estimate file size

External Email: Use caution with links and attachments


hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to control or examine the size of files as they are being written.
Ideally we would develop an interface to write a sequence of arrow::RecordBatch objects that would automatically move on to a new file once a certain approximate target size has been reached in an existing file. There's a number of moving parts that would need to be created to make this possible.

- Wes
On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal <Ha...@mathworks.co.uk> wrote:
>
> I think if I've understood the problem correctly, you could use the 
> parquet::arrow::FileWriter
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128&d=DwIFaQ&c=zU
> O0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=r
> rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko&s=zQJ4skn8jLtkXiPTGWljgDyof
> gJTKIAyAeCBwHQuamw&e=
>
> The basic pattern is to use an object to manage the FileWriter lifetime, call the WriteTable method for each row group, and close it when you are done.  My understanding is that each call to WriteTable will append a new row group which should allow you to incrementally write an out-of-memory dataset.  I realize now that I haven't tested this myself so it would be good to double-check this with someone more experienced with the parquet-cpp APIs.
>
> On 12/11/18, 12:54 AM, "Jiayuan Chen" <ha...@gmail.com> wrote:
>
>     Thanks for the suggestion, will do.
>
>     Since such high-level API is not yet implemented in the parquet-cpp
>     project, I have to turn back to use the API newly introduced in the
>     low-level API, that calculates the Parquet file size when adding data into
>     the column writers. I have another question on that part:
>
>     Is there any sample code & advice that I can follow to be able to stream
>     the Parquet file on a per rowgroup basis? In order words, to restrict
>     memory usage but still create big enough Parquet file, I would like to
>     create relatively small rowgroup in memory using InMemoryOutputStream(),
>     and dump the buffer contents to my external stream, after completing each
>     row group, until a big file with several rowgroups is finished. However, my
>     attempt to manipulate the underline arrow::Buffer have failed, that the
>     pages starting from the second rowgroup are unreadable.
>
>     Thanks!
>
>     On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <we...@gmail.com> wrote:
>
>     > hi Jiayuan,
>     >
>     > To your question
>     >
>     > > Would this be in the roadmap?
>     >
>     > I doubt there would be any objections to adding this feature to the
>     > Arrow writer API -- please feel free to open a JIRA issue to describe
>     > how the API might work in C++. Note there is no formal roadmap in this
>     > project.
>     >
>     > - Wes
>     > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <ha...@gmail.com> wrote:
>     > >
>     > > Thanks for the Python solution. However, is there a solution in C++ that
>     > I
>     > > can create such Parquet file with only in-memory buffer, using
>     > parquet-cpp
>     > > library?
>     > >
>     > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <Da...@blackrock.com>
>     > wrote:
>     > >
>     > > > Resending.. Somehow I lost some line feeds in the previous reply..
>     > > >
>     > > > import os
>     > > > import pyarrow.parquet as pq
>     > > > import glob as glob
>     > > >
>     > > > max_target_size = 134217728
>     > > > target_size = max_target_size * .95
>     > > > # Directory where parquet files are saved
>     > > > working_directory = '/tmp/test'
>     > > > files_dict = dict()
>     > > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
>     > > > files.sort()
>     > > > for file in files:
>     > > >     files_dict[file] = os.path.getsize(file)
>     > > > print("Merging parquet files")
>     > > > temp_file = os.path.join(working_directory, "temp.parquet")
>     > > > file_no = 0
>     > > > for file in files:
>     > > >     if file in files_dict:
>     > > >         file_no = file_no + 1
>     > > >         file_name = os.path.join(working_directory,
>     > str(file_no).zfill(4)
>     > > > + ".parquet")
>     > > >         print("Saving to parquet file " + file_name)
>     > > >         # Just rename file if the file size is in target range
>     > > >         if files_dict[file] > target_size:
>     > > >             del files_dict[file]
>     > > >             os.rename(file, file_name)
>     > > >             continue
>     > > >         merge_list = list()
>     > > >         file_size = 0
>     > > >         # Find files to merge together which add up to less than 128
>     > megs
>     > > >         for k, v in files_dict.items():
>     > > >             if file_size + v <= max_target_size:
>     > > >                 print("Adding file " + k + " to merge list")
>     > > >                 merge_list.append(k)
>     > > >                 file_size = file_size + v
>     > > >         # Just rename file if there is only one file to merge
>     > > >         if len(merge_list) == 1:
>     > > >             del files_dict[merge_list[0]]
>     > > >             os.rename(merge_list[0], file_name)
>     > > >             continue
>     > > >         # Merge smaller files into one large file. Read row groups from
>     > > > each file and add them to the new file.
>     > > >         schema = pq.read_schema(file)
>     > > >         print("Saving to new parquet file")
>     > > >         writer = pq.ParquetWriter(temp_file, schema=schema,
>     > > > use_dictionary=True, compression='snappy')
>     > > >         for merge in merge_list:
>     > > >             parquet_file = pq.ParquetFile(merge)
>     > > >             print("Writing " + merge + " to new parquet file")
>     > > >             for i in range(parquet_file.num_row_groups):
>     > > >                 writer.write_table(parquet_file.read_row_group(i))
>     > > >             del files_dict[merge]
>     > > >             os.remove(merge)
>     > > >         writer.close()
>     > > >         os.rename(temp_file, file_name)
>     > > >
>     > > >
>     > > > -----Original Message-----
>     > > > From: Jiayuan Chen <ha...@gmail.com>
>     > > > Sent: Monday, December 10, 2018 2:30 PM
>     > > > To: dev@parquet.apache.org
>     > > > Subject: parquet-arrow estimate file size
>     > > >
>     > > > External Email: Use caution with links and attachments
>     > > >
>     > > >
>     > > > Hello,
>     > > >
>     > > > I am a Parquet developer in the Bay Area, and I am writing this email
>     > to
>     > > > seek precious help on writing Parquet file from Arrow.
>     > > >
>     > > > My goal is to control the size (in bytes) of the output Parquet file
>     > when
>     > > > writing from existing arrow table. I saw a reply in 2017 on this
>     > > > StackOverflow post (
>     > > >
>     > > >
>     > https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
>     > > > )
>     > > > and wondering if the following implementation is currently possible:
>     > Feed
>     > > > data into the Arrow table, until at a point that the buffered data can
>     > be
>     > > > converted to a Parquet file (e.g. of size 256 MB, instead of a fix
>     > number
>     > > > of rows), and then use WriteTable() to create such Parquet file.
>     > > >
>     > > > I saw that parquet-cpp recently introduced API to control the column
>     > > > writer's size in bytes in the low-level API, but seems this is still
>     > not
>     > > > yet available for the arrow-parquet API. Would this be in the roadmap?
>     > > >
>     > > > Thanks,
>     > > > Jiayuan
>     > > >
>     > > >
>     > > > This message may contain information that is confidential or
>     > privileged.
>     > > > If you are not the intended recipient, please advise the sender
>     > immediately
>     > > > and delete this message. See
>     > > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
>     > > > further information.  Please refer to
>     > > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
>     > > > information about BlackRock’s Privacy Policy.
>     > > >
>     > > > For a list of BlackRock's office addresses worldwide, see
>     > > > http://www.blackrock.com/corporate/about-us/contacts-locations.
>     > > >
>     > > > © 2018 BlackRock, Inc. All rights reserved.
>     > > >
>     >
>
>

Re: parquet-arrow estimate file size

Posted by Wes McKinney <we...@gmail.com>.

hi Hatem -- the arrow::FileWriter class doesn't provide any way for
you to control or examine the size of files as they are being written.
Ideally we would develop an interface to write a sequence of
arrow::RecordBatch objects that would automatically move on to a new
file once a certain approximate target size has been reached in an
existing file. There's a number of moving parts that would need to be
created to make this possible.

- Wes
On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal <Ha...@mathworks.co.uk> wrote:
>
> I think if I've understood the problem correctly, you could use the parquet::arrow::FileWriter
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L128
>
> The basic pattern is to use an object to manage the FileWriter lifetime, call the WriteTable method for each row group, and close it when you are done.  My understanding is that each call to WriteTable will append a new row group which should allow you to incrementally write an out-of-memory dataset.  I realize now that I haven't tested this myself so it would be good to double-check this with someone more experienced with the parquet-cpp APIs.
>
> On 12/11/18, 12:54 AM, "Jiayuan Chen" <ha...@gmail.com> wrote:
>
>     Thanks for the suggestion, will do.
>
>     Since such high-level API is not yet implemented in the parquet-cpp
>     project, I have to turn back to use the API newly introduced in the
>     low-level API, that calculates the Parquet file size when adding data into
>     the column writers. I have another question on that part:
>
>     Is there any sample code & advice that I can follow to be able to stream
>     the Parquet file on a per rowgroup basis? In order words, to restrict
>     memory usage but still create big enough Parquet file, I would like to
>     create relatively small rowgroup in memory using InMemoryOutputStream(),
>     and dump the buffer contents to my external stream, after completing each
>     row group, until a big file with several rowgroups is finished. However, my
>     attempt to manipulate the underline arrow::Buffer have failed, that the
>     pages starting from the second rowgroup are unreadable.
>
>     Thanks!
>
>     On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <we...@gmail.com> wrote:
>
>     > hi Jiayuan,
>     >
>     > To your question
>     >
>     > > Would this be in the roadmap?
>     >
>     > I doubt there would be any objections to adding this feature to the
>     > Arrow writer API -- please feel free to open a JIRA issue to describe
>     > how the API might work in C++. Note there is no formal roadmap in this
>     > project.
>     >
>     > - Wes
>     > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <ha...@gmail.com> wrote:
>     > >
>     > > Thanks for the Python solution. However, is there a solution in C++ that
>     > I
>     > > can create such Parquet file with only in-memory buffer, using
>     > parquet-cpp
>     > > library?
>     > >
>     > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <Da...@blackrock.com>
>     > wrote:
>     > >
>     > > > Resending.. Somehow I lost some line feeds in the previous reply..
>     > > >
>     > > > import os
>     > > > import pyarrow.parquet as pq
>     > > > import glob as glob
>     > > >
>     > > > max_target_size = 134217728
>     > > > target_size = max_target_size * .95
>     > > > # Directory where parquet files are saved
>     > > > working_directory = '/tmp/test'
>     > > > files_dict = dict()
>     > > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
>     > > > files.sort()
>     > > > for file in files:
>     > > >     files_dict[file] = os.path.getsize(file)
>     > > > print("Merging parquet files")
>     > > > temp_file = os.path.join(working_directory, "temp.parquet")
>     > > > file_no = 0
>     > > > for file in files:
>     > > >     if file in files_dict:
>     > > >         file_no = file_no + 1
>     > > >         file_name = os.path.join(working_directory,
>     > str(file_no).zfill(4)
>     > > > + ".parquet")
>     > > >         print("Saving to parquet file " + file_name)
>     > > >         # Just rename file if the file size is in target range
>     > > >         if files_dict[file] > target_size:
>     > > >             del files_dict[file]
>     > > >             os.rename(file, file_name)
>     > > >             continue
>     > > >         merge_list = list()
>     > > >         file_size = 0
>     > > >         # Find files to merge together which add up to less than 128
>     > megs
>     > > >         for k, v in files_dict.items():
>     > > >             if file_size + v <= max_target_size:
>     > > >                 print("Adding file " + k + " to merge list")
>     > > >                 merge_list.append(k)
>     > > >                 file_size = file_size + v
>     > > >         # Just rename file if there is only one file to merge
>     > > >         if len(merge_list) == 1:
>     > > >             del files_dict[merge_list[0]]
>     > > >             os.rename(merge_list[0], file_name)
>     > > >             continue
>     > > >         # Merge smaller files into one large file. Read row groups from
>     > > > each file and add them to the new file.
>     > > >         schema = pq.read_schema(file)
>     > > >         print("Saving to new parquet file")
>     > > >         writer = pq.ParquetWriter(temp_file, schema=schema,
>     > > > use_dictionary=True, compression='snappy')
>     > > >         for merge in merge_list:
>     > > >             parquet_file = pq.ParquetFile(merge)
>     > > >             print("Writing " + merge + " to new parquet file")
>     > > >             for i in range(parquet_file.num_row_groups):
>     > > >                 writer.write_table(parquet_file.read_row_group(i))
>     > > >             del files_dict[merge]
>     > > >             os.remove(merge)
>     > > >         writer.close()
>     > > >         os.rename(temp_file, file_name)
>     > > >
>     > > >
>     > > > -----Original Message-----
>     > > > From: Jiayuan Chen <ha...@gmail.com>
>     > > > Sent: Monday, December 10, 2018 2:30 PM
>     > > > To: dev@parquet.apache.org
>     > > > Subject: parquet-arrow estimate file size
>     > > >
>     > > > External Email: Use caution with links and attachments
>     > > >
>     > > >
>     > > > Hello,
>     > > >
>     > > > I am a Parquet developer in the Bay Area, and I am writing this email
>     > to
>     > > > seek precious help on writing Parquet file from Arrow.
>     > > >
>     > > > My goal is to control the size (in bytes) of the output Parquet file
>     > when
>     > > > writing from existing arrow table. I saw a reply in 2017 on this
>     > > > StackOverflow post (
>     > > >
>     > > >
>     > https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
>     > > > )
>     > > > and wondering if the following implementation is currently possible:
>     > Feed
>     > > > data into the Arrow table, until at a point that the buffered data can
>     > be
>     > > > converted to a Parquet file (e.g. of size 256 MB, instead of a fix
>     > number
>     > > > of rows), and then use WriteTable() to create such Parquet file.
>     > > >
>     > > > I saw that parquet-cpp recently introduced API to control the column
>     > > > writer's size in bytes in the low-level API, but seems this is still
>     > not
>     > > > yet available for the arrow-parquet API. Would this be in the roadmap?
>     > > >
>     > > > Thanks,
>     > > > Jiayuan
>     > > >
>     > > >
>     > > > This message may contain information that is confidential or
>     > privileged.
>     > > > If you are not the intended recipient, please advise the sender
>     > immediately
>     > > > and delete this message. See
>     > > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
>     > > > further information.  Please refer to
>     > > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
>     > > > information about BlackRock’s Privacy Policy.
>     > > >
>     > > > For a list of BlackRock's office addresses worldwide, see
>     > > > http://www.blackrock.com/corporate/about-us/contacts-locations.
>     > > >
>     > > > © 2018 BlackRock, Inc. All rights reserved.
>     > > >
>     >
>
>

Re: parquet-arrow estimate file size

Posted by Hatem Helal <Ha...@mathworks.co.uk>.

I think if I've understood the problem correctly, you could use the parquet::arrow::FileWriter

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L128

The basic pattern is to use an object to manage the FileWriter lifetime, call the WriteTable method for each row group, and close it when you are done.  My understanding is that each call to WriteTable will append a new row group which should allow you to incrementally write an out-of-memory dataset.  I realize now that I haven't tested this myself so it would be good to double-check this with someone more experienced with the parquet-cpp APIs.

On 12/11/18, 12:54 AM, "Jiayuan Chen" <ha...@gmail.com> wrote:

    Thanks for the suggestion, will do.
    
    Since such high-level API is not yet implemented in the parquet-cpp
    project, I have to turn back to use the API newly introduced in the
    low-level API, that calculates the Parquet file size when adding data into
    the column writers. I have another question on that part:
    
    Is there any sample code & advice that I can follow to be able to stream
    the Parquet file on a per rowgroup basis? In order words, to restrict
    memory usage but still create big enough Parquet file, I would like to
    create relatively small rowgroup in memory using InMemoryOutputStream(),
    and dump the buffer contents to my external stream, after completing each
    row group, until a big file with several rowgroups is finished. However, my
    attempt to manipulate the underline arrow::Buffer have failed, that the
    pages starting from the second rowgroup are unreadable.
    
    Thanks!
    
    On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <we...@gmail.com> wrote:
    
    > hi Jiayuan,
    >
    > To your question
    >
    > > Would this be in the roadmap?
    >
    > I doubt there would be any objections to adding this feature to the
    > Arrow writer API -- please feel free to open a JIRA issue to describe
    > how the API might work in C++. Note there is no formal roadmap in this
    > project.
    >
    > - Wes
    > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <ha...@gmail.com> wrote:
    > >
    > > Thanks for the Python solution. However, is there a solution in C++ that
    > I
    > > can create such Parquet file with only in-memory buffer, using
    > parquet-cpp
    > > library?
    > >
    > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <Da...@blackrock.com>
    > wrote:
    > >
    > > > Resending.. Somehow I lost some line feeds in the previous reply..
    > > >
    > > > import os
    > > > import pyarrow.parquet as pq
    > > > import glob as glob
    > > >
    > > > max_target_size = 134217728
    > > > target_size = max_target_size * .95
    > > > # Directory where parquet files are saved
    > > > working_directory = '/tmp/test'
    > > > files_dict = dict()
    > > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
    > > > files.sort()
    > > > for file in files:
    > > >     files_dict[file] = os.path.getsize(file)
    > > > print("Merging parquet files")
    > > > temp_file = os.path.join(working_directory, "temp.parquet")
    > > > file_no = 0
    > > > for file in files:
    > > >     if file in files_dict:
    > > >         file_no = file_no + 1
    > > >         file_name = os.path.join(working_directory,
    > str(file_no).zfill(4)
    > > > + ".parquet")
    > > >         print("Saving to parquet file " + file_name)
    > > >         # Just rename file if the file size is in target range
    > > >         if files_dict[file] > target_size:
    > > >             del files_dict[file]
    > > >             os.rename(file, file_name)
    > > >             continue
    > > >         merge_list = list()
    > > >         file_size = 0
    > > >         # Find files to merge together which add up to less than 128
    > megs
    > > >         for k, v in files_dict.items():
    > > >             if file_size + v <= max_target_size:
    > > >                 print("Adding file " + k + " to merge list")
    > > >                 merge_list.append(k)
    > > >                 file_size = file_size + v
    > > >         # Just rename file if there is only one file to merge
    > > >         if len(merge_list) == 1:
    > > >             del files_dict[merge_list[0]]
    > > >             os.rename(merge_list[0], file_name)
    > > >             continue
    > > >         # Merge smaller files into one large file. Read row groups from
    > > > each file and add them to the new file.
    > > >         schema = pq.read_schema(file)
    > > >         print("Saving to new parquet file")
    > > >         writer = pq.ParquetWriter(temp_file, schema=schema,
    > > > use_dictionary=True, compression='snappy')
    > > >         for merge in merge_list:
    > > >             parquet_file = pq.ParquetFile(merge)
    > > >             print("Writing " + merge + " to new parquet file")
    > > >             for i in range(parquet_file.num_row_groups):
    > > >                 writer.write_table(parquet_file.read_row_group(i))
    > > >             del files_dict[merge]
    > > >             os.remove(merge)
    > > >         writer.close()
    > > >         os.rename(temp_file, file_name)
    > > >
    > > >
    > > > -----Original Message-----
    > > > From: Jiayuan Chen <ha...@gmail.com>
    > > > Sent: Monday, December 10, 2018 2:30 PM
    > > > To: dev@parquet.apache.org
    > > > Subject: parquet-arrow estimate file size
    > > >
    > > > External Email: Use caution with links and attachments
    > > >
    > > >
    > > > Hello,
    > > >
    > > > I am a Parquet developer in the Bay Area, and I am writing this email
    > to
    > > > seek precious help on writing Parquet file from Arrow.
    > > >
    > > > My goal is to control the size (in bytes) of the output Parquet file
    > when
    > > > writing from existing arrow table. I saw a reply in 2017 on this
    > > > StackOverflow post (
    > > >
    > > >
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
    > > > )
    > > > and wondering if the following implementation is currently possible:
    > Feed
    > > > data into the Arrow table, until at a point that the buffered data can
    > be
    > > > converted to a Parquet file (e.g. of size 256 MB, instead of a fix
    > number
    > > > of rows), and then use WriteTable() to create such Parquet file.
    > > >
    > > > I saw that parquet-cpp recently introduced API to control the column
    > > > writer's size in bytes in the low-level API, but seems this is still
    > not
    > > > yet available for the arrow-parquet API. Would this be in the roadmap?
    > > >
    > > > Thanks,
    > > > Jiayuan
    > > >
    > > >
    > > > This message may contain information that is confidential or
    > privileged.
    > > > If you are not the intended recipient, please advise the sender
    > immediately
    > > > and delete this message. See
    > > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
    > > > further information.  Please refer to
    > > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
    > > > information about BlackRock’s Privacy Policy.
    > > >
    > > > For a list of BlackRock's office addresses worldwide, see
    > > > http://www.blackrock.com/corporate/about-us/contacts-locations.
    > > >
    > > > © 2018 BlackRock, Inc. All rights reserved.
    > > >
    >

Re: parquet-arrow estimate file size

Posted by Jiayuan Chen <ha...@gmail.com>.

Thanks for the suggestion, will do.

Since such high-level API is not yet implemented in the parquet-cpp
project, I have to turn back to use the API newly introduced in the
low-level API, that calculates the Parquet file size when adding data into
the column writers. I have another question on that part:

Is there any sample code & advice that I can follow to be able to stream
the Parquet file on a per rowgroup basis? In order words, to restrict
memory usage but still create big enough Parquet file, I would like to
create relatively small rowgroup in memory using InMemoryOutputStream(),
and dump the buffer contents to my external stream, after completing each
row group, until a big file with several rowgroups is finished. However, my
attempt to manipulate the underline arrow::Buffer have failed, that the
pages starting from the second rowgroup are unreadable.

Thanks!

On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <we...@gmail.com> wrote:

> hi Jiayuan,
>
> To your question
>
> > Would this be in the roadmap?
>
> I doubt there would be any objections to adding this feature to the
> Arrow writer API -- please feel free to open a JIRA issue to describe
> how the API might work in C++. Note there is no formal roadmap in this
> project.
>
> - Wes
> On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <ha...@gmail.com> wrote:
> >
> > Thanks for the Python solution. However, is there a solution in C++ that
> I
> > can create such Parquet file with only in-memory buffer, using
> parquet-cpp
> > library?
> >
> > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <Da...@blackrock.com>
> wrote:
> >
> > > Resending.. Somehow I lost some line feeds in the previous reply..
> > >
> > > import os
> > > import pyarrow.parquet as pq
> > > import glob as glob
> > >
> > > max_target_size = 134217728
> > > target_size = max_target_size * .95
> > > # Directory where parquet files are saved
> > > working_directory = '/tmp/test'
> > > files_dict = dict()
> > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> > > files.sort()
> > > for file in files:
> > >     files_dict[file] = os.path.getsize(file)
> > > print("Merging parquet files")
> > > temp_file = os.path.join(working_directory, "temp.parquet")
> > > file_no = 0
> > > for file in files:
> > >     if file in files_dict:
> > >         file_no = file_no + 1
> > >         file_name = os.path.join(working_directory,
> str(file_no).zfill(4)
> > > + ".parquet")
> > >         print("Saving to parquet file " + file_name)
> > >         # Just rename file if the file size is in target range
> > >         if files_dict[file] > target_size:
> > >             del files_dict[file]
> > >             os.rename(file, file_name)
> > >             continue
> > >         merge_list = list()
> > >         file_size = 0
> > >         # Find files to merge together which add up to less than 128
> megs
> > >         for k, v in files_dict.items():
> > >             if file_size + v <= max_target_size:
> > >                 print("Adding file " + k + " to merge list")
> > >                 merge_list.append(k)
> > >                 file_size = file_size + v
> > >         # Just rename file if there is only one file to merge
> > >         if len(merge_list) == 1:
> > >             del files_dict[merge_list[0]]
> > >             os.rename(merge_list[0], file_name)
> > >             continue
> > >         # Merge smaller files into one large file. Read row groups from
> > > each file and add them to the new file.
> > >         schema = pq.read_schema(file)
> > >         print("Saving to new parquet file")
> > >         writer = pq.ParquetWriter(temp_file, schema=schema,
> > > use_dictionary=True, compression='snappy')
> > >         for merge in merge_list:
> > >             parquet_file = pq.ParquetFile(merge)
> > >             print("Writing " + merge + " to new parquet file")
> > >             for i in range(parquet_file.num_row_groups):
> > >                 writer.write_table(parquet_file.read_row_group(i))
> > >             del files_dict[merge]
> > >             os.remove(merge)
> > >         writer.close()
> > >         os.rename(temp_file, file_name)
> > >
> > >
> > > -----Original Message-----
> > > From: Jiayuan Chen <ha...@gmail.com>
> > > Sent: Monday, December 10, 2018 2:30 PM
> > > To: dev@parquet.apache.org
> > > Subject: parquet-arrow estimate file size
> > >
> > > External Email: Use caution with links and attachments
> > >
> > >
> > > Hello,
> > >
> > > I am a Parquet developer in the Bay Area, and I am writing this email
> to
> > > seek precious help on writing Parquet file from Arrow.
> > >
> > > My goal is to control the size (in bytes) of the output Parquet file
> when
> > > writing from existing arrow table. I saw a reply in 2017 on this
> > > StackOverflow post (
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
> > > )
> > > and wondering if the following implementation is currently possible:
> Feed
> > > data into the Arrow table, until at a point that the buffered data can
> be
> > > converted to a Parquet file (e.g. of size 256 MB, instead of a fix
> number
> > > of rows), and then use WriteTable() to create such Parquet file.
> > >
> > > I saw that parquet-cpp recently introduced API to control the column
> > > writer's size in bytes in the low-level API, but seems this is still
> not
> > > yet available for the arrow-parquet API. Would this be in the roadmap?
> > >
> > > Thanks,
> > > Jiayuan
> > >
> > >
> > > This message may contain information that is confidential or
> privileged.
> > > If you are not the intended recipient, please advise the sender
> immediately
> > > and delete this message. See
> > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > > further information.  Please refer to
> > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > > information about BlackRock’s Privacy Policy.
> > >
> > > For a list of BlackRock's office addresses worldwide, see
> > > http://www.blackrock.com/corporate/about-us/contacts-locations.
> > >
> > > © 2018 BlackRock, Inc. All rights reserved.
> > >
>

Re: parquet-arrow estimate file size

Posted by Wes McKinney <we...@gmail.com>.

hi Jiayuan,

To your question

> Would this be in the roadmap?

I doubt there would be any objections to adding this feature to the
Arrow writer API -- please feel free to open a JIRA issue to describe
how the API might work in C++. Note there is no formal roadmap in this
project.

- Wes
On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <ha...@gmail.com> wrote:
>
> Thanks for the Python solution. However, is there a solution in C++ that I
> can create such Parquet file with only in-memory buffer, using parquet-cpp
> library?
>
> On Mon, Dec 10, 2018 at 3:23 PM Lee, David <Da...@blackrock.com> wrote:
>
> > Resending.. Somehow I lost some line feeds in the previous reply..
> >
> > import os
> > import pyarrow.parquet as pq
> > import glob as glob
> >
> > max_target_size = 134217728
> > target_size = max_target_size * .95
> > # Directory where parquet files are saved
> > working_directory = '/tmp/test'
> > files_dict = dict()
> > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> > files.sort()
> > for file in files:
> >     files_dict[file] = os.path.getsize(file)
> > print("Merging parquet files")
> > temp_file = os.path.join(working_directory, "temp.parquet")
> > file_no = 0
> > for file in files:
> >     if file in files_dict:
> >         file_no = file_no + 1
> >         file_name = os.path.join(working_directory, str(file_no).zfill(4)
> > + ".parquet")
> >         print("Saving to parquet file " + file_name)
> >         # Just rename file if the file size is in target range
> >         if files_dict[file] > target_size:
> >             del files_dict[file]
> >             os.rename(file, file_name)
> >             continue
> >         merge_list = list()
> >         file_size = 0
> >         # Find files to merge together which add up to less than 128 megs
> >         for k, v in files_dict.items():
> >             if file_size + v <= max_target_size:
> >                 print("Adding file " + k + " to merge list")
> >                 merge_list.append(k)
> >                 file_size = file_size + v
> >         # Just rename file if there is only one file to merge
> >         if len(merge_list) == 1:
> >             del files_dict[merge_list[0]]
> >             os.rename(merge_list[0], file_name)
> >             continue
> >         # Merge smaller files into one large file. Read row groups from
> > each file and add them to the new file.
> >         schema = pq.read_schema(file)
> >         print("Saving to new parquet file")
> >         writer = pq.ParquetWriter(temp_file, schema=schema,
> > use_dictionary=True, compression='snappy')
> >         for merge in merge_list:
> >             parquet_file = pq.ParquetFile(merge)
> >             print("Writing " + merge + " to new parquet file")
> >             for i in range(parquet_file.num_row_groups):
> >                 writer.write_table(parquet_file.read_row_group(i))
> >             del files_dict[merge]
> >             os.remove(merge)
> >         writer.close()
> >         os.rename(temp_file, file_name)
> >
> >
> > -----Original Message-----
> > From: Jiayuan Chen <ha...@gmail.com>
> > Sent: Monday, December 10, 2018 2:30 PM
> > To: dev@parquet.apache.org
> > Subject: parquet-arrow estimate file size
> >
> > External Email: Use caution with links and attachments
> >
> >
> > Hello,
> >
> > I am a Parquet developer in the Bay Area, and I am writing this email to
> > seek precious help on writing Parquet file from Arrow.
> >
> > My goal is to control the size (in bytes) of the output Parquet file when
> > writing from existing arrow table. I saw a reply in 2017 on this
> > StackOverflow post (
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
> > )
> > and wondering if the following implementation is currently possible: Feed
> > data into the Arrow table, until at a point that the buffered data can be
> > converted to a Parquet file (e.g. of size 256 MB, instead of a fix number
> > of rows), and then use WriteTable() to create such Parquet file.
> >
> > I saw that parquet-cpp recently introduced API to control the column
> > writer's size in bytes in the low-level API, but seems this is still not
> > yet available for the arrow-parquet API. Would this be in the roadmap?
> >
> > Thanks,
> > Jiayuan
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender immediately
> > and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2018 BlackRock, Inc. All rights reserved.
> >

Re: parquet-arrow estimate file size

Posted by Jiayuan Chen <ha...@gmail.com>.

Thanks for the Python solution. However, is there a solution in C++ that I
can create such Parquet file with only in-memory buffer, using parquet-cpp
library?

On Mon, Dec 10, 2018 at 3:23 PM Lee, David <Da...@blackrock.com> wrote:

> Resending.. Somehow I lost some line feeds in the previous reply..
>
> import os
> import pyarrow.parquet as pq
> import glob as glob
>
> max_target_size = 134217728
> target_size = max_target_size * .95
> # Directory where parquet files are saved
> working_directory = '/tmp/test'
> files_dict = dict()
> files = glob.glob(os.path.join(working_directory, "*.parquet"))
> files.sort()
> for file in files:
>     files_dict[file] = os.path.getsize(file)
> print("Merging parquet files")
> temp_file = os.path.join(working_directory, "temp.parquet")
> file_no = 0
> for file in files:
>     if file in files_dict:
>         file_no = file_no + 1
>         file_name = os.path.join(working_directory, str(file_no).zfill(4)
> + ".parquet")
>         print("Saving to parquet file " + file_name)
>         # Just rename file if the file size is in target range
>         if files_dict[file] > target_size:
>             del files_dict[file]
>             os.rename(file, file_name)
>             continue
>         merge_list = list()
>         file_size = 0
>         # Find files to merge together which add up to less than 128 megs
>         for k, v in files_dict.items():
>             if file_size + v <= max_target_size:
>                 print("Adding file " + k + " to merge list")
>                 merge_list.append(k)
>                 file_size = file_size + v
>         # Just rename file if there is only one file to merge
>         if len(merge_list) == 1:
>             del files_dict[merge_list[0]]
>             os.rename(merge_list[0], file_name)
>             continue
>         # Merge smaller files into one large file. Read row groups from
> each file and add them to the new file.
>         schema = pq.read_schema(file)
>         print("Saving to new parquet file")
>         writer = pq.ParquetWriter(temp_file, schema=schema,
> use_dictionary=True, compression='snappy')
>         for merge in merge_list:
>             parquet_file = pq.ParquetFile(merge)
>             print("Writing " + merge + " to new parquet file")
>             for i in range(parquet_file.num_row_groups):
>                 writer.write_table(parquet_file.read_row_group(i))
>             del files_dict[merge]
>             os.remove(merge)
>         writer.close()
>         os.rename(temp_file, file_name)
>
>
> -----Original Message-----
> From: Jiayuan Chen <ha...@gmail.com>
> Sent: Monday, December 10, 2018 2:30 PM
> To: dev@parquet.apache.org
> Subject: parquet-arrow estimate file size
>
> External Email: Use caution with links and attachments
>
>
> Hello,
>
> I am a Parquet developer in the Bay Area, and I am writing this email to
> seek precious help on writing Parquet file from Arrow.
>
> My goal is to control the size (in bytes) of the output Parquet file when
> writing from existing arrow table. I saw a reply in 2017 on this
> StackOverflow post (
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
> )
> and wondering if the following implementation is currently possible: Feed
> data into the Arrow table, until at a point that the buffered data can be
> converted to a Parquet file (e.g. of size 256 MB, instead of a fix number
> of rows), and then use WriteTable() to create such Parquet file.
>
> I saw that parquet-cpp recently introduced API to control the column
> writer's size in bytes in the low-level API, but seems this is still not
> yet available for the arrow-parquet API. Would this be in the roadmap?
>
> Thanks,
> Jiayuan
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2018 BlackRock, Inc. All rights reserved.
>

RE: parquet-arrow estimate file size

Posted by "Lee, David" <Da...@blackrock.com>.

Resending.. Somehow I lost some line feeds in the previous reply..

import os
import pyarrow.parquet as pq
import glob as glob
 
max_target_size = 134217728
target_size = max_target_size * .95
# Directory where parquet files are saved
working_directory = '/tmp/test'
files_dict = dict()
files = glob.glob(os.path.join(working_directory, "*.parquet"))
files.sort()
for file in files:
    files_dict[file] = os.path.getsize(file)
print("Merging parquet files")
temp_file = os.path.join(working_directory, "temp.parquet")
file_no = 0
for file in files:
    if file in files_dict:
        file_no = file_no + 1
        file_name = os.path.join(working_directory, str(file_no).zfill(4) + ".parquet")
        print("Saving to parquet file " + file_name)
        # Just rename file if the file size is in target range
        if files_dict[file] > target_size:
            del files_dict[file]
            os.rename(file, file_name)
            continue
        merge_list = list()
        file_size = 0
        # Find files to merge together which add up to less than 128 megs
        for k, v in files_dict.items():
            if file_size + v <= max_target_size:
                print("Adding file " + k + " to merge list")
                merge_list.append(k)
                file_size = file_size + v
        # Just rename file if there is only one file to merge
        if len(merge_list) == 1:
            del files_dict[merge_list[0]]
            os.rename(merge_list[0], file_name)
            continue
        # Merge smaller files into one large file. Read row groups from each file and add them to the new file.
        schema = pq.read_schema(file)
        print("Saving to new parquet file")
        writer = pq.ParquetWriter(temp_file, schema=schema, use_dictionary=True, compression='snappy')
        for merge in merge_list:
            parquet_file = pq.ParquetFile(merge)
            print("Writing " + merge + " to new parquet file")
            for i in range(parquet_file.num_row_groups):
                writer.write_table(parquet_file.read_row_group(i))
            del files_dict[merge]
            os.remove(merge)
        writer.close()
        os.rename(temp_file, file_name)


-----Original Message-----
From: Jiayuan Chen <ha...@gmail.com> 
Sent: Monday, December 10, 2018 2:30 PM
To: dev@parquet.apache.org
Subject: parquet-arrow estimate file size

External Email: Use caution with links and attachments


Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to seek precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when writing from existing arrow table. I saw a reply in 2017 on this StackOverflow post (
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=)
and wondering if the following implementation is currently possible: Feed data into the Arrow table, until at a point that the buffered data can be converted to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column writer's size in bytes in the low-level API, but seems this is still not yet available for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan


This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.

RE: parquet-arrow estimate file size

Posted by "Lee, David" <Da...@blackrock.com>.

Here's some sample code:

import os
import pyarrow.parquet as pq
import glob as glob
 
max_target_size = 134217728
target_size = max_target_size * .95
# Directory where parquet files are saved
working_directory = '/tmp/test'
files_dict = dict()
files = glob.glob(os.path.join(working_directory, "*.parquet"))
files.sort()
for file in files:
    files_dict[file] = os.path.getsize(file)
print("Merging parquet files")
temp_file = os.path.join(working_directory, "temp.parquet")
file_no = 0
for file in files:
    if file in files_dict:
        file_no = file_no + 1
        file_name = os.path.join(working_directory, str(file_no).zfill(4) + ".parquet")
        print("Saving to parquet file " + file_name)
        # Just rename file if the file size is in target range
        if files_dict[file] > target_size:
            del files_dict[file]
            os.rename(file, file_name)
            continue
        merge_list = list()
        file_size = 0
        # Find files to merge together which add up to less than 128 megs
        for k, v in files_dict.items():
            if file_size + v <= max_target_size:
                print("Adding file " + k + " to merge list")
                merge_list.append(k)
                file_size = file_size + v
        # Just rename file if there is only one file to merge
        if len(merge_list) == 1:
            del files_dict[merge_list[0]]
            os.rename(merge_list[0], file_name)
            continue
        # Merge smaller files into one large file. Read row groups from each file and add them to the new file.
        schema = pq.read_schema(file)
        print("Saving to new parquet file")
        writer = pq.ParquetWriter(temp_file, schema=schema, use_dictionary=True, compression='snappy')
        for merge in merge_list:
            parquet_file = pq.ParquetFile(merge)
            print("Writing " + merge + " to new parquet file")
            for i in range(parquet_file.num_row_groups):
                writer.write_table(parquet_file.read_row_group(i))
            del files_dict[merge]
            os.remove(merge)
        writer.close()
        os.rename(temp_file, file_name)

-----Original Message-----
From: Jiayuan Chen <ha...@gmail.com> 
Sent: Monday, December 10, 2018 2:30 PM
To: dev@parquet.apache.org
Subject: parquet-arrow estimate file size

External Email: Use caution with links and attachments


Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to seek precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when writing from existing arrow table. I saw a reply in 2017 on this StackOverflow post (
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=)
and wondering if the following implementation is currently possible: Feed data into the Arrow table, until at a point that the buffered data can be converted to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column writer's size in bytes in the low-level API, but seems this is still not yet available for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan


This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.

RE: parquet-arrow estimate file size

Posted by "Lee, David" <Da...@blackrock.com>.

Here's my comment and how I'm generating 128 meg parquet files. This takes into account file sizes after compression and dictionary encoding.

https://issues.apache.org/jira/browse/ARROW-3728?focusedCommentId=16703544&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16703544

Would be nice to have a merge() parquet file function that does something similar to create parquet files which match HDFS block sizes.


-----Original Message-----
From: Jiayuan Chen <ha...@gmail.com> 
Sent: Monday, December 10, 2018 2:30 PM
To: dev@parquet.apache.org
Subject: parquet-arrow estimate file size

External Email: Use caution with links and attachments


Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to seek precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when writing from existing arrow table. I saw a reply in 2017 on this StackOverflow post (
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=)
and wondering if the following implementation is currently possible: Feed data into the Arrow table, until at a point that the buffered data can be converted to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column writer's size in bytes in the low-level API, but seems this is still not yet available for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan


This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.