You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Jason Sachs <jm...@gmail.com> on 2020/11/03 20:41:34 UTC

Best way to store ragged packet data in Parquet files

(reposted here; I had posted to dev@arrow by mistake)

Hi all--

I've been getting started with Parquet as a storage alternative to HDF5 and it has a lot of attractive quantities including compression flexibility efficiency.

But I'm stumped for storage efficiency in Parquet with one type of data that I have.

This is a large series of "ragged" packets arriving as a stream, where each packet consists of up to 255 bytes of binary data. The vast majority of the packets have lengths between 96 and 112 bytes. I need to store each of them with a 64-bit timestamp.

I can get a good storage efficiency with HDF5 with the following table schema using pytables:

class StoredPacket(pt.IsDescription):
    timetick = pt.UInt64Col(pos=0)
    length = pt.UInt16Col(pos=1)
    data = pt.UInt8Col(pos=2,shape=(255,))

This stores packet data as an array of uint8 with length 255. I zero-pad the packet to length 255 and store the length as well in a separate column.

I have created a sample file in a Github gist: https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc (see test1.h5) along with a Python script convert_test1.py that converts it to a Pandas DataFrame and stores it via Parquet. But the Parquet files are almost twice as large as the .h5 file no matter what storage technique I use; brotli is best but slow, and zstd is almost as good as brotli but much faster.

Any suggestions on how I might improve storage efficiency in Parquet? I have a lot of flexibility with how I can store the data; my only requirement is that I can retrieve the data packets quickly from the storage file. I offer this sample file as a test case.

(py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py
Table overview:
       timetick  length                                               data
0            16      99  b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\...
1            32      99  b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x...
2            48      99  b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\...
3            64      99  b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19...
4            80      99  b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\...
..         ...     ...                                                ...
16413    262080      99  b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +...
16414    262096      99  b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+...
16415    262112      99  b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x...
16416    262128      99  b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf...
16417    262144      99  b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\...

[16418 rows x 3 columns]

Packets with tags >= 128:
       timetick  length                                               data
179        2864      36  b"\xca'Twas brillig, and the slithy toves\x00\...
307        4896      35  b'\xca  Did gyre and gimble in the wabe:\x00\x...
340        5408      30  b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
362        5744      31  b'\xca  And the mome raths outgrabe.\x00\x00\x...
651       10352       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
1403      22368      32  b'\xca"Beware the Jabberwock, my son!\x00\x00\...
1741      27760      44  b'\xca  The jaws that bite, the claws that cat...
2115      33728      33  b'\xcaBeware the Jubjub bird, and shun\x00\x00...
2162      34464      30  b'\xca  The frumious Bandersnatch!"\x00\x00\x0...
2278      36304       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
2405      38320      34  b'\xcaHe took his vorpal sword in hand:\x00\x0...
2675      42624      41  b'\xca  Long time the manxome foe he sought --...
2896      46144      33  b'\xcaSo rested he by the Tumtum tree,\x00\x00...
3611      57568      31  b'\xca  And stood awhile in thought.\x00\x00\x...
4089      65200       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
5231      83456      36  b'\xcaAnd, as in uffish thought he stood,\x00\...
5236      83520      38  b'\xca  The Jabberwock, with eyes of flame,\x0...
5427      86560      40  b'\xcaCame whiffling through the tulgey wood,\...
6904     110176      26  b'\xca  And burbled as it came!\x00\x00\x00\x0...
7003     111744       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
7286     116256      44  b'\xcaOne, two! One, two! And through and thro...
8226     131280      39  b'\xca  The vorpal blade went snicker-snack!\x...
8370     133568      35  b'\xcaHe left it dead, and with its head\x00\x...
8849     141216      27  b'\xca  He went galumphing back.\x00\x00\x00\x...
10326    164832       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
11867    189472      37  b'\xca"And, has thou slain the Jabberwock?\x00...
12392    197856      35  b'\xca  Come to my arms, my beamish boy!\x00\x...
12936    206544      34  b"\xcaO frabjous day! Callooh! Callay!'\x00\x0...
13794    220256      26  b'\xca  He chortled in his joy.\x00\x00\x00\x0...
13905    222016       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
14690    234560      36  b"\xca'Twas brillig, and the slithy toves\x00\...
15317    244576      35  b'\xca  Did gyre and gimble in the wabe;\x00\x...
15840    252928      30  b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
16339    260896      31  b'\xca  And the mome raths outgrabe.\x00\x00\x...

(py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.*
-rw-rw-rw-   1 user     group      908773 Nov  2 13:07 test1.h5
-rw-rw-rw-   1 user     group     1611025 Nov  2 13:35 test1.pq

(py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5
Opened "test1.h5" with sec2 driver.
/                        Group
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "GROUP"
    Attribute: PYTABLES_FORMAT_VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "2.1"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "1.0"
    Location:  1:96
    Links:     1
/data                    Group
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "GROUP"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "1.0"
    Location:  1:1024
    Links:     1
/data/packets            Dataset {16418/Inf}
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "TABLE"
    Attribute: FIELD_0_FILL scalar
        Type:      native unsigned long long
        Data:  0
    Attribute: FIELD_0_NAME scalar
        Type:      8-byte null-terminated UTF-8 string
        Data:  "timetick"
    Attribute: FIELD_1_FILL scalar
        Type:      native unsigned short
        Data:  0
    Attribute: FIELD_1_NAME scalar
        Type:      6-byte null-terminated UTF-8 string
        Data:  "length"
    Attribute: FIELD_2_FILL scalar
        Type:      native unsigned char
        Data:  0
    Attribute: FIELD_2_NAME scalar
        Type:      4-byte null-terminated UTF-8 string
        Data:  "data"
    Attribute: NROWS scalar
        Type:      native long long
        Data:  16418
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "2.7"
    Location:  1:2216
    Links:     1
    Chunks:    {247} 65455 bytes
    Storage:   4350770 logical bytes, 899061 allocated bytes, 483.92% utilization
    Filter-0:  shuffle-2 OPT {265}
    Filter-1:  deflate-1 OPT {5}
    Type:      struct {
                   "timetick"         +0    native unsigned long long
                   "length"           +8    native unsigned short
                   "data"             +10   [255] native unsigned char
               } 265 bytes


Re: Best way to store ragged packet data in Parquet files

Posted by Micah Kornfield <em...@gmail.com>.
>
> For the test cases I have, >99% of the packets are the same length, so
> there's little-to-no benefit of removing the padding; the length field and
> zero padding barely adds anything once you factor compression into the mix.


Are you writing the data out as fixed size bytes arrays or as variable
length binary data?

On Tue, Nov 3, 2020 at 1:26 PM Jason Sachs <jm...@gmail.com> wrote:

>
>
> On 2020/11/03 20:49:46, Micah Kornfield <em...@gmail.com> wrote:
> > Hi Jason,
> > At least as a first pass I would try to avoid the padding and storing the
> > length separately in Parquet.  Using one column for timestamp and one
> > column of bytes for the data is what I would try first.  If there is any
> > structure to the packets splitting them into the structure could also
> help.
> >
> > -Micah
>
> For the test cases I have, >99% of the packets are the same length, so
> there's little-to-no benefit of removing the padding; the length field and
> zero padding barely adds anything once you factor compression into the mix.
>
> I've tried use_dictionaries=False and that does help some.
>
> But I'll post an updated example to back these statements up and see how
> much better I can get.
>
> I'm just surprised that hdf5 does a better job in this case; maybe I don't
> understand the constraints the file format imposes on data compression.
>

Re: Best way to store ragged packet data in Parquet files

Posted by Jason Sachs <jm...@gmail.com>.

On 2020/11/03 20:49:46, Micah Kornfield <em...@gmail.com> wrote: 
> Hi Jason,
> At least as a first pass I would try to avoid the padding and storing the
> length separately in Parquet.  Using one column for timestamp and one
> column of bytes for the data is what I would try first.  If there is any
> structure to the packets splitting them into the structure could also help.
> 
> -Micah

For the test cases I have, >99% of the packets are the same length, so there's little-to-no benefit of removing the padding; the length field and zero padding barely adds anything once you factor compression into the mix.

I've tried use_dictionaries=False and that does help some.

But I'll post an updated example to back these statements up and see how much better I can get.

I'm just surprised that hdf5 does a better job in this case; maybe I don't understand the constraints the file format imposes on data compression.

Re: Best way to store ragged packet data in Parquet files

Posted by Micah Kornfield <em...@gmail.com>.
Hi Jason,
At least as a first pass I would try to avoid the padding and storing the
length separately in Parquet.  Using one column for timestamp and one
column of bytes for the data is what I would try first.  If there is any
structure to the packets splitting them into the structure could also help.

-Micah

On Tue, Nov 3, 2020 at 12:41 PM Jason Sachs <jm...@gmail.com> wrote:

> (reposted here; I had posted to dev@arrow by mistake)
>
> Hi all--
>
> I've been getting started with Parquet as a storage alternative to HDF5
> and it has a lot of attractive quantities including compression flexibility
> efficiency.
>
> But I'm stumped for storage efficiency in Parquet with one type of data
> that I have.
>
> This is a large series of "ragged" packets arriving as a stream, where
> each packet consists of up to 255 bytes of binary data. The vast majority
> of the packets have lengths between 96 and 112 bytes. I need to store each
> of them with a 64-bit timestamp.
>
> I can get a good storage efficiency with HDF5 with the following table
> schema using pytables:
>
> class StoredPacket(pt.IsDescription):
>     timetick = pt.UInt64Col(pos=0)
>     length = pt.UInt16Col(pos=1)
>     data = pt.UInt8Col(pos=2,shape=(255,))
>
> This stores packet data as an array of uint8 with length 255. I zero-pad
> the packet to length 255 and store the length as well in a separate column.
>
> I have created a sample file in a Github gist:
> https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc (see
> test1.h5) along with a Python script convert_test1.py that converts it to a
> Pandas DataFrame and stores it via Parquet. But the Parquet files are
> almost twice as large as the .h5 file no matter what storage technique I
> use; brotli is best but slow, and zstd is almost as good as brotli but much
> faster.
>
> Any suggestions on how I might improve storage efficiency in Parquet? I
> have a lot of flexibility with how I can store the data; my only
> requirement is that I can retrieve the data packets quickly from the
> storage file. I offer this sample file as a test case.
>
> (py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py
> Table overview:
>        timetick  length                                               data
> 0            16      99  b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\...
> 1            32      99  b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x...
> 2            48      99  b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\...
> 3            64      99  b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19...
> 4            80      99  b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\...
> ..         ...     ...                                                ...
> 16413    262080      99  b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +...
> 16414    262096      99  b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+...
> 16415    262112      99  b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x...
> 16416    262128      99  b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf...
> 16417    262144      99  b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\...
>
> [16418 rows x 3 columns]
>
> Packets with tags >= 128:
>        timetick  length                                               data
> 179        2864      36  b"\xca'Twas brillig, and the slithy toves\x00\...
> 307        4896      35  b'\xca  Did gyre and gimble in the wabe:\x00\x...
> 340        5408      30  b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
> 362        5744      31  b'\xca  And the mome raths outgrabe.\x00\x00\x...
> 651       10352       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 1403      22368      32  b'\xca"Beware the Jabberwock, my son!\x00\x00\...
> 1741      27760      44  b'\xca  The jaws that bite, the claws that cat...
> 2115      33728      33  b'\xcaBeware the Jubjub bird, and shun\x00\x00...
> 2162      34464      30  b'\xca  The frumious Bandersnatch!"\x00\x00\x0...
> 2278      36304       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 2405      38320      34  b'\xcaHe took his vorpal sword in hand:\x00\x0...
> 2675      42624      41  b'\xca  Long time the manxome foe he sought --...
> 2896      46144      33  b'\xcaSo rested he by the Tumtum tree,\x00\x00...
> 3611      57568      31  b'\xca  And stood awhile in thought.\x00\x00\x...
> 4089      65200       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 5231      83456      36  b'\xcaAnd, as in uffish thought he stood,\x00\...
> 5236      83520      38  b'\xca  The Jabberwock, with eyes of flame,\x0...
> 5427      86560      40  b'\xcaCame whiffling through the tulgey wood,\...
> 6904     110176      26  b'\xca  And burbled as it came!\x00\x00\x00\x0...
> 7003     111744       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 7286     116256      44  b'\xcaOne, two! One, two! And through and thro...
> 8226     131280      39  b'\xca  The vorpal blade went snicker-snack!\x...
> 8370     133568      35  b'\xcaHe left it dead, and with its head\x00\x...
> 8849     141216      27  b'\xca  He went galumphing back.\x00\x00\x00\x...
> 10326    164832       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 11867    189472      37  b'\xca"And, has thou slain the Jabberwock?\x00...
> 12392    197856      35  b'\xca  Come to my arms, my beamish boy!\x00\x...
> 12936    206544      34  b"\xcaO frabjous day! Callooh! Callay!'\x00\x0...
> 13794    220256      26  b'\xca  He chortled in his joy.\x00\x00\x00\x0...
> 13905    222016       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 14690    234560      36  b"\xca'Twas brillig, and the slithy toves\x00\...
> 15317    244576      35  b'\xca  Did gyre and gimble in the wabe;\x00\x...
> 15840    252928      30  b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
> 16339    260896      31  b'\xca  And the mome raths outgrabe.\x00\x00\x...
>
> (py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.*
> -rw-rw-rw-   1 user     group      908773 Nov  2 13:07 test1.h5
> -rw-rw-rw-   1 user     group     1611025 Nov  2 13:35 test1.pq
>
> (py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5
> Opened "test1.h5" with sec2 driver.
> /                        Group
>     Attribute: CLASS scalar
>         Type:      5-byte null-terminated UTF-8 string
>         Data:  "GROUP"
>     Attribute: PYTABLES_FORMAT_VERSION scalar
>         Type:      3-byte null-terminated UTF-8 string
>         Data:  "2.1"
>     Attribute: TITLE null
>         Type:      1-byte null-terminated UTF-8 string
>
>     Attribute: VERSION scalar
>         Type:      3-byte null-terminated UTF-8 string
>         Data:  "1.0"
>     Location:  1:96
>     Links:     1
> /data                    Group
>     Attribute: CLASS scalar
>         Type:      5-byte null-terminated UTF-8 string
>         Data:  "GROUP"
>     Attribute: TITLE null
>         Type:      1-byte null-terminated UTF-8 string
>
>     Attribute: VERSION scalar
>         Type:      3-byte null-terminated UTF-8 string
>         Data:  "1.0"
>     Location:  1:1024
>     Links:     1
> /data/packets            Dataset {16418/Inf}
>     Attribute: CLASS scalar
>         Type:      5-byte null-terminated UTF-8 string
>         Data:  "TABLE"
>     Attribute: FIELD_0_FILL scalar
>         Type:      native unsigned long long
>         Data:  0
>     Attribute: FIELD_0_NAME scalar
>         Type:      8-byte null-terminated UTF-8 string
>         Data:  "timetick"
>     Attribute: FIELD_1_FILL scalar
>         Type:      native unsigned short
>         Data:  0
>     Attribute: FIELD_1_NAME scalar
>         Type:      6-byte null-terminated UTF-8 string
>         Data:  "length"
>     Attribute: FIELD_2_FILL scalar
>         Type:      native unsigned char
>         Data:  0
>     Attribute: FIELD_2_NAME scalar
>         Type:      4-byte null-terminated UTF-8 string
>         Data:  "data"
>     Attribute: NROWS scalar
>         Type:      native long long
>         Data:  16418
>     Attribute: TITLE null
>         Type:      1-byte null-terminated UTF-8 string
>
>     Attribute: VERSION scalar
>         Type:      3-byte null-terminated UTF-8 string
>         Data:  "2.7"
>     Location:  1:2216
>     Links:     1
>     Chunks:    {247} 65455 bytes
>     Storage:   4350770 logical bytes, 899061 allocated bytes, 483.92%
> utilization
>     Filter-0:  shuffle-2 OPT {265}
>     Filter-1:  deflate-1 OPT {5}
>     Type:      struct {
>                    "timetick"         +0    native unsigned long long
>                    "length"           +8    native unsigned short
>                    "data"             +10   [255] native unsigned char
>                } 265 bytes
>
>