You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "alippai (via GitHub)" <gi...@apache.org> on 2023/03/09 05:13:23 UTC

[GitHub] [arrow] alippai opened a new issue, #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

alippai opened a new issue, #34510:
URL: https://github.com/apache/arrow/issues/34510

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I'm not 100% sure if it's a bug, but I don't understand the differences between the two cases:
   Nested arrays:
   ```
   table with 30 columns, 2 level "index" (so 2 columns) and 29 FixedSizeList<double, 80> columns, 100k rows
   ```
   Exploded:
   ```
   table with 31 columns, 3 level "index" (so 3 columns) and 29 double columns, 8000k rows
   ```
   
   Reading the first table from parquet (version 2.6, zstd compression) is surprisingly slower than reading the second table. I'd assume it's the same task, a few columns are even shorter. The file sizes are almost equal.
   
   I used pyarrow 11 from conda and local SSD.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464158098

   Looks like I have to learn a lot about repetition and definition levels, but also it looks like they can be RLE encoded which means practically zero overhead if not many nulls are used - it can be equal or similar to the non-nullable in the best scenario.
   
   I'm not a C++ coder, but summarizing the above discussion there are 2-3 fast paths missing at different hierarchy levels:
   
   1. Calculating the def and rep levels for 100k rows with all non-null values takes 2x time as reading 8M doubles (this is suspicious, but might be correct)
   2. Definition level data is all 0 and supposed to be RLE encoded and we might want to skip expanding it (maybe decode it as not nullable as checking all values is not needed?)
   3. Repetition level data is a vector of `0` followed by 79x`1` for our case. I'm not sure if RLE will help here, sounds like an unnecessary complex structure for fixed size lists. On the other hand reading or decoding it might be skipped as it can be derived from the metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1463168911

   Thanks, I'll testing this tonight. Currently I guess constructing FixedSizeList may use some space and consuming some time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1463548318

   In parquet C++, we have "parquet-arrow-reader-writer-benchmark", and I found that, `BM_ReadListColumn` is lower than `BM_ReadColumn<...,Int64Type>`. It's mainly because the hanlding of List.
   
   In Parquet Schema Converting, FixedSizeList is handled same as List:
   
   ```
       case ArrowTypeId::FIXED_SIZE_LIST:
       case ArrowTypeId::LARGE_LIST:
       case ArrowTypeId::LIST: {
         auto list_type = std::static_pointer_cast<::arrow::BaseListType>(field->type());
         return ListToNode(list_type, name, field->nullable(), field_id, properties,
                           arrow_properties, out);
       }
   ```
   
   So, maybe it will allocate rep-levels for it, and when decoding, it will produce rep-levels. And decoding / encoding rep-levels may consuming some time.
   
   For non-null double, it doesn't have List, so, it's not neccessary for it to write a rep-level or def-level.
   
   I guess there are some other reason, but I'm not familiar with python to C++ code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1463884481

   > and it could be optimized to use a single static value, right?
   
   Yes, in the future, developer may optimize it. If `FixedSizeArray` is non-nullable, Parquet can have a single static value, but if `FixedSizeArray` is non-nullable, it cannot.
   
   > other reasons or replevels cascade somehow into even worse perf
   
   I've profile the C++ part, in my MacOS with release (O2):
   
   <img width="1305" alt="C811842B-4DDA-423A-B8A1-EC6A7E4ADE33" src="https://user-images.githubusercontent.com/24351052/224341042-8d2cb020-2aa6-41f4-8c66-a913436ea1c2.png">
   
   1. Decoding double is fast
   2. Decoding levels use nearly same time as Decoding double
   3. Constructing List cost a little time
   
   The benchmark uses list rather than FixedSizeList, but I think the benchmark is similiar.
   
   I'm not so familiar with Python part, maybe someone can profile that path


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464331411

   Thanks, this is an amazing explanation. 
   
   The other day I saw a Tensor canonical extension type discussion on the Arrow mailing list, which builds on top of FixedSizeList. It looks like it means partial parquet support only, in most cases it'll be cheaper to simply denormalize the data.
   
   A slightly faster alternative is using fixed size binary data (it's still slower than doubles, but much better than the `FixedSizeList<double>`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464950991

   Learned a lot, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1462753928

   The same happens with not null values (I'm not sure how to define the not null list correctly, but looks like it doesn't matter):
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   arr_random = np.random.default_rng().standard_normal(size=[8000000], dtype='float64')
   arr1 = pa.array(arr_random)
   arr2 = pa.FixedSizeListArray.from_arrays(arr_random, 80)
   t1 = pa.Table.from_arrays([arr1], schema=pa.schema([('A', pa.float64(), False)]))
   t2 = pa.Table.from_arrays([arr2], schema=pa.schema([('A', pa.list_(pa.field('A', pa.float64(), False), 80), False)]))
   t3 = pa.Table.from_arrays([arr2], schema=pa.schema([pa.field('A', pa.list_(pa.float64(), 80), False)]))
   
   pq.write_table(t1, 't1.parquet')
   pq.write_table(t2, 't2.parquet')
   pq.write_table(t3, 't3.parquet')
   ```
   `%%timeit`
   ```python
   t1 = pq.read_table('t1.parquet') # 30ms
   ```
   `%%timeit`
   ```python
   t2 = pq.read_table('t2.parquet') # 100ms
   ```
   `%%timeit`
   ```python
   t3 = pq.read_table('t3.parquet') # 100ms
   ```
   ```python
   print(t1.get_total_buffer_size(), t2.get_total_buffer_size(), t3.get_total_buffer_size()) # (64000000, 64000000, 64000000)
   print(t1.schema, t2.schema, t3.schema)
   # (A: double not null,
   # A: fixed_size_list<A: double not null>[80] not null
   #   child 0, A: double not null,
   # A: fixed_size_list<item: double>[80] not null
   #   child 0, item: double)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai closed issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai closed issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows
URL: https://github.com/apache/arrow/issues/34510


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464833898

   Maybe you can use Fixed Sized binary and have a transmute rule for it. Like `FixedLengthBinary(sizeof(double) * 80)`, and use zero-copying here. But it's still tricky...
   
   > or even some of the other primitive encodings such as the deeply flawed DELTA_BINARY_PACKED
   
   Hi @tustvold . To be honest, I wonder why `DELTA_BINARY_PACKED` is deeply flawed, it's here any reference or comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] tustvold commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464387983

> It looks like it means partial parquet support only, in most cases it'll be cheaper to simply denormalize the data.

If denormalizing is an option, it will definitely be faster, at least until parquet adds native support for fixed size repeated elements (relatively unlikely - lots of readers don't even support v2 which is a decade old now). However, I believe the use-case for tensor types is to serialize other columns alongside, at which point denormalizing may not be possible.

That being said, whilst FixedSizeList may be slow compared to native primitives, compared to decoding byte arrays, or even some of the other primitive encodings such as the deeply flawed DELTA_BINARY_PACKED, it should still be pretty fast. Certainly compared to other commonly used serialization formats for tensors such as protobuf or JSON, it will be significantly faster.

To be honest parquet's tag line could be "It's good enough". You can almost certainly do 2-3x better than parquet for any given workload, but you really need orders of magnitude improvements to overcome ecosystem inertia. I suspect most workloads will also mix in byte arrays and/or object storage or block compression, at which point those will easily be the tall pole in decode performance.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] tustvold commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464215953

   > @alamb @tustvold I saw your blog post about this for arrow-rs. Do you handle this differently in Rust?
   
   We don't support FixedSizeList in arrow-rs AFAIK. Parquet to my knowledge does not have an equivalent [logical](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) construct, and so it isn't particularly clear to me what support would mean other than implicitly casting between a regular list and a fixed size list.
   
   > Calculating the def and rep levels for 100k rows with all non-null values takes 2x time as reading 8M doubles
   
   Assuming the doubles are PLAIN encoded this is not surprising, you are comparing the performance of what is effectively a `memcpy` that will run at the memory bandwidth, to a fairly complex [bit-packing](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) scheme used for the definition and repetition levels.
   
   In the Rust implementation we have a couple of tricks that help here, but it is still relatively expensive (at least compared to primitive decoding):
   
   * We decode definition levels directly to the null buffer if there are only nulls at the leaf level (i.e. no lists or nested nulls), allowing us to preserve the bit-packing
   * We have vectorised unpack implementations specialised for each bit width (I believe arrow C++ does this also)
   
   > Definition level data is all 1 and supposed to be RLE encoded
   
   It will actually all be 2, unless the doubles are themselves not nullable
   
   > Repetition level data is a vector of 0 followed by 79x1 repeated 100k times for our case. I'm not sure if RLE will help here, sounds like an unnecessary complex structure for fixed size lists
   
   These repetition levels will be [RLE encoded](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3). Theoretically a reader could preserve this, but the record shredding logic is extremely fiddly and so might run the risk of adding complexity to an already very complex piece of code. At least in arrow-rs we decode to an array of `i16`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1462453540

   Everything was nullable, I’ll check with not nulls and provide a minimal example as well tomorrow 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464174659

   @alamb @tustvold  I saw your post about this for arrow-rs. Do you handle this differently in Rust?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1462419763

   Can you provide some same code to reproduce the problem here? And does the FixedSizeList and double within it is nullable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1463868182

   2x allocation / comp time for rep/dev levels compared to the array values sounds excessive as well (and it could be optimized to use a single static value, right?)
   
   I agree what you’ve found is the main reason and also I share your suspicion that it’s still too slow and there might be other reasons or replevels cascade somehow into even worse perf :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] tustvold commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464864258

   > why DELTA_BINARY_PACKED is deeply flawed
   
   The paper they link to actually explains why the approach is problematic - http://arxiv.org/pdf/1209.2137v5.pdf. The whole paper is on why not to implement delta compression in this way 😂


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464463384

   Agreed, also I'm closing this issue as in this format it's not really actionable.
   Arrow based fixed size lists of primitive values (eg. tensors) shouldn't be converted to nested parquet data, but instead they are better as BYTE_ARRAY in parquet (while I think it'd be important sadly there is no fixed size BYTE_ARRAY in the parquet spec so it'll be still slightly slower than possible). Also some fast paths for never null data - which was not marked as non-nullable when the data was saved - can be useful too, but that's all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org