You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "lccnl (via GitHub)" <gi...@apache.org> on 2023/03/20 09:44:52 UTC

[GitHub] [arrow] lccnl opened a new issue, #34639: [Python] Info about offset is loss when converting struct array to record batch

lccnl opened a new issue, #34639:
URL: https://github.com/apache/arrow/issues/34639

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hello,
   it seems that when a struct is pointing to a part of a larger array via an offset and we try to convert it to a record batch, that information is lost and we get a record batch with columns having the larger array length.
   
   For now, a workaround is to select all the actual values in the array before converting it to a recordbatch (though this solution does not scale well, slicing does not work).
   
   The following code reproduces the error and shows the workaround: 
   ```import pyarrow as pa
   
   # create a struct and a table having as column the struct, then split it into record batches
   struct_1=pa.StructArray.from_arrays([pa.array([1.,2.]),pa.array(['a','b'])],names=['col1','col2'])
   out=pa.Table.from_arrays(arrays=[struct_1],names=['struct'])
   batches=out.to_batches(max_chunksize=1)
   
   #convert to  struct arrays, here each array has an offset referencing the table
   arrays=[pa.StructArray.from_arrays(batch.columns,names=batch.schema.names) for batch in batches]
   #select the struct
   modified_arrays=[array.field('struct') for array in arrays]
   
   # just take length of each array to show difference
   taken_arrays=[array.take(pa.array([0])) for array in modified_arrays]
   
   for standard,taken in zip(modified_arrays,taken_arrays):
       #arrays are equals
       assert standard==taken
       assert len(standard)==len(taken)==1
       #but record batches are different!
       assert len(pa.RecordBatch.from_struct_array(standard).column(0))==2
       assert len(pa.RecordBatch.from_struct_array(taken).column(0))==1 ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace closed issue #34639: [Python] Info about offset is loss when converting struct array to record batch

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace closed issue #34639: [Python] Info about offset is loss when converting struct array to record batch
URL: https://github.com/apache/arrow/issues/34639


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34639: [Python] Info about offset is loss when converting struct array to record batch

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34639:
URL: https://github.com/apache/arrow/issues/34639#issuecomment-1536580746

   Alright, I've filed #35450 and #35452 and I've [asked on the ML](https://lists.apache.org/thread/6jtyf5xhfdocb2rlx1jfjwx0rj4hn6o1) about these kinds of arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lccnl commented on issue #34639: [Python] Info about offset is loss when converting struct array to record batch

Posted by "lccnl (via GitHub)" <gi...@apache.org>.

lccnl commented on issue #34639:
URL: https://github.com/apache/arrow/issues/34639#issuecomment-1533300802

   Hello,
   I tested the fix on pyarrow 12 and it seems that the bug is still there:
   
   ```
   assert(len(pa.RecordBatch.from_struct_array(standard).column(0)))==len(pa.RecordBatch.from_struct_array(taken).column(0))```
   
   yields an error 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34639: [Python] Info about offset is loss when converting struct array to record batch

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34639:
URL: https://github.com/apache/arrow/issues/34639#issuecomment-1536510963

   I've reopened this so we can verify but I think it is actually doing the right thing.  Although I think there is another bug in to_struct_array and to_pandas (:face_exhaling:)
   
   ```
   > pa.RecordBatch.from_struct_array(standard)
   ```
   
   This will give you a record batch that has length 1 with two child arrays that each have length 2.  This is allowed because it lets us use zero-copy.
   
   ```
   >>> x = pa.RecordBatch.from_struct_array(standard)
   >>> print(x) # Sadly, we don't print the contents here
   pyarrow.RecordBatch
   col1: double
   col2: string
   >>> print(x.num_rows) # This is correct
   1
   >>> print(x.column(0)) # This is arguably correct but misleading
   [
     1,
     2
   ]
   >>> print(x.to_pylist()) # This is correct
   [{'col1': 1.0, 'col2': 'a'}]
   >>> print(x.to_struct_array()) # This is wrong
   -- is_valid: all not null
   -- child 0 type: double
     [
       1,
       2
     ]
   -- child 1 type: string
     [
       "a",
       "b"
     ]
   >>> print(x.to_pandas()) # this is also wrong
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 852, in pyarrow.lib._PandasConvertible.to_pandas
     File "pyarrow/table.pxi", line 2506, in pyarrow.lib.RecordBatch._to_pandas
     File "pyarrow/table.pxi", line 4075, in pyarrow.lib.Table._to_pandas
     File "/home/pace/dev/arrow/python/pyarrow/pandas_compat.py", line 823, in table_to_blockmanager
       return BlockManager(blocks, axes)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/pace/miniconda3/envs/conbench3/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1040, in __init__
       self._verify_integrity()
     File "/home/pace/miniconda3/envs/conbench3/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1047, in _verify_integrity
       raise construction_error(tot_items, block.shape[1:], self.axes)
   ValueError: Shape of passed values is (2, 2), indices imply (1, 2)
   ```
   
   I will open up two new issues for to_struct_array and to_pandas.  Arguably, we should also modify `to_batches` to push "short lengths" into the arrays themselves.  I'll have to ask on the ML if it's legal for a record batch and its arrays to have different lengths.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace closed issue #34639: [Python] Info about offset is loss when converting struct array to record batch

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace closed issue #34639: [Python] Info about offset is loss when converting struct array to record batch
URL: https://github.com/apache/arrow/issues/34639


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34639: [Python] Info about offset is loss when converting struct array to record batch

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34639:
URL: https://github.com/apache/arrow/issues/34639#issuecomment-1539009340

   > One thing I have noticed is that if you do your same test on the second batch:
   
   Yes, since the second batch has a non-zero offset we follow a different code path that handles the non-matching lengths better.  I will close this back up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lccnl commented on issue #34639: [Python] Info about offset is loss when converting struct array to record batch

Posted by "lccnl (via GitHub)" <gi...@apache.org>.

lccnl commented on issue #34639:
URL: https://github.com/apache/arrow/issues/34639#issuecomment-1537135126

   ok thank you for the explanation and for opening specific issues!
   
   One thing I have noticed is that if you do your same test on the second batch: 
   ````
   x=pa.RecordBatch.from_struct_array(modified_arrays[1])
   ````
   everything works fine (methods to_pandas and to.struct_array yield the correct array with one row). 
   
   In the first case (taking the first batch), the method to_pydict also returns the wrong result (I guess it must be related to how to_struct_array works?).
   
   If you want to close this issue, I can follow the two you have opened.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org