You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Jason Sachs <jm...@gmail.com> on 2020/11/03 20:41:34 UTC
Best way to store ragged packet data in Parquet files
(reposted here; I had posted to dev@arrow by mistake)
Hi all--
I've been getting started with Parquet as a storage alternative to HDF5 and it has a lot of attractive quantities including compression flexibility efficiency.
But I'm stumped for storage efficiency in Parquet with one type of data that I have.
This is a large series of "ragged" packets arriving as a stream, where each packet consists of up to 255 bytes of binary data. The vast majority of the packets have lengths between 96 and 112 bytes. I need to store each of them with a 64-bit timestamp.
I can get a good storage efficiency with HDF5 with the following table schema using pytables:
class StoredPacket(pt.IsDescription):
timetick = pt.UInt64Col(pos=0)
length = pt.UInt16Col(pos=1)
data = pt.UInt8Col(pos=2,shape=(255,))
This stores packet data as an array of uint8 with length 255. I zero-pad the packet to length 255 and store the length as well in a separate column.
I have created a sample file in a Github gist: https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc (see test1.h5) along with a Python script convert_test1.py that converts it to a Pandas DataFrame and stores it via Parquet. But the Parquet files are almost twice as large as the .h5 file no matter what storage technique I use; brotli is best but slow, and zstd is almost as good as brotli but much faster.
Any suggestions on how I might improve storage efficiency in Parquet? I have a lot of flexibility with how I can store the data; my only requirement is that I can retrieve the data packets quickly from the storage file. I offer this sample file as a test case.
(py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py
Table overview:
timetick length data
0 16 99 b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\...
1 32 99 b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x...
2 48 99 b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\...
3 64 99 b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19...
4 80 99 b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\...
.. ... ... ...
16413 262080 99 b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +...
16414 262096 99 b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+...
16415 262112 99 b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x...
16416 262128 99 b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf...
16417 262144 99 b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\...
[16418 rows x 3 columns]
Packets with tags >= 128:
timetick length data
179 2864 36 b"\xca'Twas brillig, and the slithy toves\x00\...
307 4896 35 b'\xca Did gyre and gimble in the wabe:\x00\x...
340 5408 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
362 5744 31 b'\xca And the mome raths outgrabe.\x00\x00\x...
651 10352 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
1403 22368 32 b'\xca"Beware the Jabberwock, my son!\x00\x00\...
1741 27760 44 b'\xca The jaws that bite, the claws that cat...
2115 33728 33 b'\xcaBeware the Jubjub bird, and shun\x00\x00...
2162 34464 30 b'\xca The frumious Bandersnatch!"\x00\x00\x0...
2278 36304 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
2405 38320 34 b'\xcaHe took his vorpal sword in hand:\x00\x0...
2675 42624 41 b'\xca Long time the manxome foe he sought --...
2896 46144 33 b'\xcaSo rested he by the Tumtum tree,\x00\x00...
3611 57568 31 b'\xca And stood awhile in thought.\x00\x00\x...
4089 65200 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
5231 83456 36 b'\xcaAnd, as in uffish thought he stood,\x00\...
5236 83520 38 b'\xca The Jabberwock, with eyes of flame,\x0...
5427 86560 40 b'\xcaCame whiffling through the tulgey wood,\...
6904 110176 26 b'\xca And burbled as it came!\x00\x00\x00\x0...
7003 111744 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
7286 116256 44 b'\xcaOne, two! One, two! And through and thro...
8226 131280 39 b'\xca The vorpal blade went snicker-snack!\x...
8370 133568 35 b'\xcaHe left it dead, and with its head\x00\x...
8849 141216 27 b'\xca He went galumphing back.\x00\x00\x00\x...
10326 164832 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
11867 189472 37 b'\xca"And, has thou slain the Jabberwock?\x00...
12392 197856 35 b'\xca Come to my arms, my beamish boy!\x00\x...
12936 206544 34 b"\xcaO frabjous day! Callooh! Callay!'\x00\x0...
13794 220256 26 b'\xca He chortled in his joy.\x00\x00\x00\x0...
13905 222016 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
14690 234560 36 b"\xca'Twas brillig, and the slithy toves\x00\...
15317 244576 35 b'\xca Did gyre and gimble in the wabe;\x00\x...
15840 252928 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
16339 260896 31 b'\xca And the mome raths outgrabe.\x00\x00\x...
(py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.*
-rw-rw-rw- 1 user group 908773 Nov 2 13:07 test1.h5
-rw-rw-rw- 1 user group 1611025 Nov 2 13:35 test1.pq
(py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5
Opened "test1.h5" with sec2 driver.
/ Group
Attribute: CLASS scalar
Type: 5-byte null-terminated UTF-8 string
Data: "GROUP"
Attribute: PYTABLES_FORMAT_VERSION scalar
Type: 3-byte null-terminated UTF-8 string
Data: "2.1"
Attribute: TITLE null
Type: 1-byte null-terminated UTF-8 string
Attribute: VERSION scalar
Type: 3-byte null-terminated UTF-8 string
Data: "1.0"
Location: 1:96
Links: 1
/data Group
Attribute: CLASS scalar
Type: 5-byte null-terminated UTF-8 string
Data: "GROUP"
Attribute: TITLE null
Type: 1-byte null-terminated UTF-8 string
Attribute: VERSION scalar
Type: 3-byte null-terminated UTF-8 string
Data: "1.0"
Location: 1:1024
Links: 1
/data/packets Dataset {16418/Inf}
Attribute: CLASS scalar
Type: 5-byte null-terminated UTF-8 string
Data: "TABLE"
Attribute: FIELD_0_FILL scalar
Type: native unsigned long long
Data: 0
Attribute: FIELD_0_NAME scalar
Type: 8-byte null-terminated UTF-8 string
Data: "timetick"
Attribute: FIELD_1_FILL scalar
Type: native unsigned short
Data: 0
Attribute: FIELD_1_NAME scalar
Type: 6-byte null-terminated UTF-8 string
Data: "length"
Attribute: FIELD_2_FILL scalar
Type: native unsigned char
Data: 0
Attribute: FIELD_2_NAME scalar
Type: 4-byte null-terminated UTF-8 string
Data: "data"
Attribute: NROWS scalar
Type: native long long
Data: 16418
Attribute: TITLE null
Type: 1-byte null-terminated UTF-8 string
Attribute: VERSION scalar
Type: 3-byte null-terminated UTF-8 string
Data: "2.7"
Location: 1:2216
Links: 1
Chunks: {247} 65455 bytes
Storage: 4350770 logical bytes, 899061 allocated bytes, 483.92% utilization
Filter-0: shuffle-2 OPT {265}
Filter-1: deflate-1 OPT {5}
Type: struct {
"timetick" +0 native unsigned long long
"length" +8 native unsigned short
"data" +10 [255] native unsigned char
} 265 bytes
Re: Best way to store ragged packet data in Parquet files
Posted by Micah Kornfield <em...@gmail.com>.
>
> For the test cases I have, >99% of the packets are the same length, so
> there's little-to-no benefit of removing the padding; the length field and
> zero padding barely adds anything once you factor compression into the mix.
Are you writing the data out as fixed size bytes arrays or as variable
length binary data?
On Tue, Nov 3, 2020 at 1:26 PM Jason Sachs <jm...@gmail.com> wrote:
>
>
> On 2020/11/03 20:49:46, Micah Kornfield <em...@gmail.com> wrote:
> > Hi Jason,
> > At least as a first pass I would try to avoid the padding and storing the
> > length separately in Parquet. Using one column for timestamp and one
> > column of bytes for the data is what I would try first. If there is any
> > structure to the packets splitting them into the structure could also
> help.
> >
> > -Micah
>
> For the test cases I have, >99% of the packets are the same length, so
> there's little-to-no benefit of removing the padding; the length field and
> zero padding barely adds anything once you factor compression into the mix.
>
> I've tried use_dictionaries=False and that does help some.
>
> But I'll post an updated example to back these statements up and see how
> much better I can get.
>
> I'm just surprised that hdf5 does a better job in this case; maybe I don't
> understand the constraints the file format imposes on data compression.
>
Re: Best way to store ragged packet data in Parquet files
Posted by Jason Sachs <jm...@gmail.com>.
On 2020/11/03 20:49:46, Micah Kornfield <em...@gmail.com> wrote:
> Hi Jason,
> At least as a first pass I would try to avoid the padding and storing the
> length separately in Parquet. Using one column for timestamp and one
> column of bytes for the data is what I would try first. If there is any
> structure to the packets splitting them into the structure could also help.
>
> -Micah
For the test cases I have, >99% of the packets are the same length, so there's little-to-no benefit of removing the padding; the length field and zero padding barely adds anything once you factor compression into the mix.
I've tried use_dictionaries=False and that does help some.
But I'll post an updated example to back these statements up and see how much better I can get.
I'm just surprised that hdf5 does a better job in this case; maybe I don't understand the constraints the file format imposes on data compression.
Re: Best way to store ragged packet data in Parquet files
Posted by Micah Kornfield <em...@gmail.com>.
Hi Jason,
At least as a first pass I would try to avoid the padding and storing the
length separately in Parquet. Using one column for timestamp and one
column of bytes for the data is what I would try first. If there is any
structure to the packets splitting them into the structure could also help.
-Micah
On Tue, Nov 3, 2020 at 12:41 PM Jason Sachs <jm...@gmail.com> wrote:
> (reposted here; I had posted to dev@arrow by mistake)
>
> Hi all--
>
> I've been getting started with Parquet as a storage alternative to HDF5
> and it has a lot of attractive quantities including compression flexibility
> efficiency.
>
> But I'm stumped for storage efficiency in Parquet with one type of data
> that I have.
>
> This is a large series of "ragged" packets arriving as a stream, where
> each packet consists of up to 255 bytes of binary data. The vast majority
> of the packets have lengths between 96 and 112 bytes. I need to store each
> of them with a 64-bit timestamp.
>
> I can get a good storage efficiency with HDF5 with the following table
> schema using pytables:
>
> class StoredPacket(pt.IsDescription):
> timetick = pt.UInt64Col(pos=0)
> length = pt.UInt16Col(pos=1)
> data = pt.UInt8Col(pos=2,shape=(255,))
>
> This stores packet data as an array of uint8 with length 255. I zero-pad
> the packet to length 255 and store the length as well in a separate column.
>
> I have created a sample file in a Github gist:
> https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc (see
> test1.h5) along with a Python script convert_test1.py that converts it to a
> Pandas DataFrame and stores it via Parquet. But the Parquet files are
> almost twice as large as the .h5 file no matter what storage technique I
> use; brotli is best but slow, and zstd is almost as good as brotli but much
> faster.
>
> Any suggestions on how I might improve storage efficiency in Parquet? I
> have a lot of flexibility with how I can store the data; my only
> requirement is that I can retrieve the data packets quickly from the
> storage file. I offer this sample file as a test case.
>
> (py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py
> Table overview:
> timetick length data
> 0 16 99 b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\...
> 1 32 99 b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x...
> 2 48 99 b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\...
> 3 64 99 b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19...
> 4 80 99 b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\...
> .. ... ... ...
> 16413 262080 99 b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +...
> 16414 262096 99 b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+...
> 16415 262112 99 b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x...
> 16416 262128 99 b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf...
> 16417 262144 99 b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\...
>
> [16418 rows x 3 columns]
>
> Packets with tags >= 128:
> timetick length data
> 179 2864 36 b"\xca'Twas brillig, and the slithy toves\x00\...
> 307 4896 35 b'\xca Did gyre and gimble in the wabe:\x00\x...
> 340 5408 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
> 362 5744 31 b'\xca And the mome raths outgrabe.\x00\x00\x...
> 651 10352 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 1403 22368 32 b'\xca"Beware the Jabberwock, my son!\x00\x00\...
> 1741 27760 44 b'\xca The jaws that bite, the claws that cat...
> 2115 33728 33 b'\xcaBeware the Jubjub bird, and shun\x00\x00...
> 2162 34464 30 b'\xca The frumious Bandersnatch!"\x00\x00\x0...
> 2278 36304 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 2405 38320 34 b'\xcaHe took his vorpal sword in hand:\x00\x0...
> 2675 42624 41 b'\xca Long time the manxome foe he sought --...
> 2896 46144 33 b'\xcaSo rested he by the Tumtum tree,\x00\x00...
> 3611 57568 31 b'\xca And stood awhile in thought.\x00\x00\x...
> 4089 65200 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 5231 83456 36 b'\xcaAnd, as in uffish thought he stood,\x00\...
> 5236 83520 38 b'\xca The Jabberwock, with eyes of flame,\x0...
> 5427 86560 40 b'\xcaCame whiffling through the tulgey wood,\...
> 6904 110176 26 b'\xca And burbled as it came!\x00\x00\x00\x0...
> 7003 111744 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 7286 116256 44 b'\xcaOne, two! One, two! And through and thro...
> 8226 131280 39 b'\xca The vorpal blade went snicker-snack!\x...
> 8370 133568 35 b'\xcaHe left it dead, and with its head\x00\x...
> 8849 141216 27 b'\xca He went galumphing back.\x00\x00\x00\x...
> 10326 164832 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 11867 189472 37 b'\xca"And, has thou slain the Jabberwock?\x00...
> 12392 197856 35 b'\xca Come to my arms, my beamish boy!\x00\x...
> 12936 206544 34 b"\xcaO frabjous day! Callooh! Callay!'\x00\x0...
> 13794 220256 26 b'\xca He chortled in his joy.\x00\x00\x00\x0...
> 13905 222016 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 14690 234560 36 b"\xca'Twas brillig, and the slithy toves\x00\...
> 15317 244576 35 b'\xca Did gyre and gimble in the wabe;\x00\x...
> 15840 252928 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
> 16339 260896 31 b'\xca And the mome raths outgrabe.\x00\x00\x...
>
> (py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.*
> -rw-rw-rw- 1 user group 908773 Nov 2 13:07 test1.h5
> -rw-rw-rw- 1 user group 1611025 Nov 2 13:35 test1.pq
>
> (py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5
> Opened "test1.h5" with sec2 driver.
> / Group
> Attribute: CLASS scalar
> Type: 5-byte null-terminated UTF-8 string
> Data: "GROUP"
> Attribute: PYTABLES_FORMAT_VERSION scalar
> Type: 3-byte null-terminated UTF-8 string
> Data: "2.1"
> Attribute: TITLE null
> Type: 1-byte null-terminated UTF-8 string
>
> Attribute: VERSION scalar
> Type: 3-byte null-terminated UTF-8 string
> Data: "1.0"
> Location: 1:96
> Links: 1
> /data Group
> Attribute: CLASS scalar
> Type: 5-byte null-terminated UTF-8 string
> Data: "GROUP"
> Attribute: TITLE null
> Type: 1-byte null-terminated UTF-8 string
>
> Attribute: VERSION scalar
> Type: 3-byte null-terminated UTF-8 string
> Data: "1.0"
> Location: 1:1024
> Links: 1
> /data/packets Dataset {16418/Inf}
> Attribute: CLASS scalar
> Type: 5-byte null-terminated UTF-8 string
> Data: "TABLE"
> Attribute: FIELD_0_FILL scalar
> Type: native unsigned long long
> Data: 0
> Attribute: FIELD_0_NAME scalar
> Type: 8-byte null-terminated UTF-8 string
> Data: "timetick"
> Attribute: FIELD_1_FILL scalar
> Type: native unsigned short
> Data: 0
> Attribute: FIELD_1_NAME scalar
> Type: 6-byte null-terminated UTF-8 string
> Data: "length"
> Attribute: FIELD_2_FILL scalar
> Type: native unsigned char
> Data: 0
> Attribute: FIELD_2_NAME scalar
> Type: 4-byte null-terminated UTF-8 string
> Data: "data"
> Attribute: NROWS scalar
> Type: native long long
> Data: 16418
> Attribute: TITLE null
> Type: 1-byte null-terminated UTF-8 string
>
> Attribute: VERSION scalar
> Type: 3-byte null-terminated UTF-8 string
> Data: "2.7"
> Location: 1:2216
> Links: 1
> Chunks: {247} 65455 bytes
> Storage: 4350770 logical bytes, 899061 allocated bytes, 483.92%
> utilization
> Filter-0: shuffle-2 OPT {265}
> Filter-1: deflate-1 OPT {5}
> Type: struct {
> "timetick" +0 native unsigned long long
> "length" +8 native unsigned short
> "data" +10 [255] native unsigned char
> } 265 bytes
>
>