You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by jonathan mercier <jo...@cnrgh.fr> on 2020/03/28 17:12:49 UTC

Is it more efficient to store string into fixed binary size ?

Dear,

I continue my learning with arrow. 

1/ I would like to know if it is more efficient to store string as fxed
binary size ?

example:

---------------------------------------------------------------------
from pyarrow import Schema, Table, binary, schema, array
def to_binaries( s: str, size: int = 50) -> bytes: 
     nullchar = size - len(s) 
     if nullchar < 0: 
         raise Exception(f'String has more than {size} character:
{s}') 
     b = s.encode('ascii') + b'\0' * nullchar 
     return b 


fields = [('ID', binary(50))]
sc = schema(fields)
d = [ 'test', 'ab', 'bc', 'cd' ]
db = array([ to_binaries(x) for x in d ], type=binary(50))
t = Table.from_arrays(arrays=[db], schema=sc)

---------------------------------------------------------------------

2/ I misunderstood the part of Writing and Reading Streams
     \_ 
https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams

I we use the provided writer, the final file format it is parquet file
?

Thanks

best regards

Re: Is it more efficient to store string into fixed binary size ?

Posted by Wes McKinney <we...@gmail.com>.

hi,

On Sat, Mar 28, 2020 at 12:13 PM jonathan mercier
<jo...@cnrgh.fr> wrote:
>
> Dear,
>
> I continue my learning with arrow.
>
> 1/ I would like to know if it is more efficient to store string as fxed
> binary size ?
>
> example:
>
> ---------------------------------------------------------------------
> from pyarrow import Schema, Table, binary, schema, array
> def to_binaries( s: str, size: int = 50) -> bytes:
>      nullchar = size - len(s)
>      if nullchar < 0:
>          raise Exception(f'String has more than {size} character:
> {s}')
>      b = s.encode('ascii') + b'\0' * nullchar
>      return b

It depends. The standard variable string type has 17 bits / 4.125
bytes of overhead per value. Fixed size strings have no overhead but
used a fixed amount of space. There is also no notion of nul
terminator, though an application could use nul terminators to embed
smaller strings in fixed size types.

>
> fields = [('ID', binary(50))]
> sc = schema(fields)
> d = [ 'test', 'ab', 'bc', 'cd' ]
> db = array([ to_binaries(x) for x in d ], type=binary(50))
> t = Table.from_arrays(arrays=[db], schema=sc)
>
> ---------------------------------------------------------------------
>
> 2/ I misunderstood the part of Writing and Reading Streams
>      \_
> https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams
>
> I we use the provided writer, the final file format it is parquet file
> ?

Nope. This is the Arrow binary protocol defined in Columnar.rst. If
you want to create Parquet files you need to use the functions in
pyarrow.parquet

>
> Thanks
>
> best regards
>