You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/20 10:38:45 UTC

[GitHub] [arrow] vibhatha opened a new pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

vibhatha opened a new pull request #11993:
URL: https://github.com/apache/arrow/pull/11993


   ## Notes on PR 
   
   In this PR, I am working on exposing the ReferencedBufferSize as described here: https://issues.apache.org/jira/browse/ARROW-15065
   
   In this working branch, I am addressing the ARROW-15135 issue (Python bindings) 
   
   Currently this is *work in progress*, but created a draft PR after applying the expected functionality on Array interface where the buffer size is exposed using a new function. The idea is to differentiate the `nbytes` from the actual amount of bytes allocated for data. 
   
   ### TODO:
   
   - [ ] Figure out best location to place the bindings
   - [ ] Bindings for
          - [ ] Table
          - [ ] ArrayData
          - [ ] ChunkedArray
          - [ ] RecordBatch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r774624436



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,12 +988,64 @@ cdef class Array(_PandasConvertible):
         """
         Total number of bytes consumed by the elements of the array.
         """
+        return self.get_refererenced_buffer_size()
+
+    def get_total_size_by_buffer(self):

Review comment:
       I added a discussion in the ticket while going through the code
   
   https://issues.apache.org/jira/browse/ARROW-15065




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r772568484



##########
File path: python/pyarrow/array.pxi
##########
@@ -994,6 +994,27 @@ cdef class Array(_PandasConvertible):
                 size += buf.size
         return size
 
+    def get_refererenced_buffer_size(self):
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+        
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()
+        c_res_buffer = ReferencedBufferSize(c_array[0])
+        size = GetResultValue(c_res_buffer)
+        return size
+
+    @property
+    def get_buffer_size(self):

Review comment:
       I believe the goal was for `nbytes` to use `get_referenced_buffer_size` and for the old `nbytes` behavior to become a new method.
   
   Note: this was not what was said in the JIRA, but based on comments, I think it is what we wanted.  I have updated the description in the JIRA to be more clear on this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r774623561



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,12 +988,64 @@ cdef class Array(_PandasConvertible):
         """
         Total number of bytes consumed by the elements of the array.
         """
+        return self.get_refererenced_buffer_size()
+
+    def get_total_size_by_buffer(self):

Review comment:
       Good catch. 
   
   This is a question I also have. Here the Python method counts the elements as same as it is done in the C++ method `TotalBufferSize`. For now I kept it as this is still WIP. I also felt like we must remove the calculation done in Python and just use the C++ method. Am I following this right? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1015053038


   @jorisvandenbossche I updated the corresponding docs for table, chunked-array and recordbatch along with suggested changes for array doc strings. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace closed pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

westonpace closed pull request #11993:
URL: https://github.com/apache/arrow/pull/11993


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r786415586



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2460,18 +2460,24 @@ def test_buffers_nested():
     assert struct.unpack('4xh', values) == (43,)
 
 
-def test_nbytes_sizeof():
+def test_total_buffer_size():

Review comment:
       added a test case. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1015264187


   You can ignore the python doc failure, that was failing on master as well (I actually just merged a fix for that, so if you rebase that should be resolved)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r786528062



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,51 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()
+        c_res_buffer = ReferencedBufferSize(deref(c_array))
+        size = GetResultValue(c_res_buffer)
         return size
 
+    def get_total_buffer_size(self):
+        """
+        The sum of bytes in each buffer referenced by the array.
+
+        An array may only reference a portion of a buffer.
+        This method will overestimate in this case and return the
+        byte size of the entire buffer.
+
+        If a buffer is referenced multiple times then it will
+        only be counted once.
+        """
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            int64_t total_buffer_size
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()
+        total_buffer_size = TotalBufferSize(c_array[0])

Review comment:
       ```suggestion
           total_buffer_size = TotalBufferSize(deref(c_array))
   ```
   
   (like above?)

##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,51 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()

Review comment:
       So I _think_ you can simplify this to `c_res_buffer = ReferencedBufferSize(deref(self.ap))`

##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2469,16 +2469,30 @@ def test_buffers_nested():
     assert struct.unpack('4xh', values) == (43,)
 
 
-def test_nbytes_sizeof():
+def test_total_buffer_size():
     a = pa.array(np.array([4, 5, 6], dtype='int64'))
-    assert a.nbytes == 8 * 3
+    assert a.get_total_buffer_size() == 8 * 3

Review comment:
       Can you keep both `assert a.nbytes == ..` and `assert a.get_total_buffer_size() == ..` here (and same for the ones below)

##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,51 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()

Review comment:
       To get the C++ pointer, you can actually use `self.sp_array` or `self.ap` attributes, instead of calling `pyarrow_unwrap_array` on `self` (that method is only used for eg variables passed to a method)

##########
File path: python/pyarrow/table.pxi
##########
@@ -145,12 +145,50 @@ cdef class ChunkedArray(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the chunked array.
+
+        In other words, the sum of bytes from all buffer ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will only be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for chunk in self.iterchunks():
-            size += chunk.nbytes
+        cdef:
+            shared_ptr[CChunkedArray] shd_ptr_c_array
+            CChunkedArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_chunked_array(self)
+        c_array = shd_ptr_c_array.get()
+        c_res_buffer = ReferencedBufferSize(deref(c_array))

Review comment:
       Similar comment here about that this can be simplified (no `pyarrow_unwrap_chunked_array(self)` needed)), just for each class the exact attribute name to use might be different.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r778487164



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,12 +988,53 @@ cdef class Array(_PandasConvertible):
         """
         Total number of bytes consumed by the elements of the array.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        return self.get_referenced_buffer_size()
+
+    def get_referenced_buffer_size(self):

Review comment:
       Hmm, from the discussion I thought we were just going to name this `get_buffer_size` since `get_referenced_buffer_size` was a  little confusing.
   
   Actually, since `nbytes` is an exact proxy/forward of `get_referenced_buffer_size` can we get rid of the method entirely and just have the property (moving the method's implementation into the property)

##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2460,18 +2460,24 @@ def test_buffers_nested():
     assert struct.unpack('4xh', values) == (43,)
 
 
-def test_nbytes_sizeof():
+def test_total_buffer_size():

Review comment:
       Can we add just one test where the array is sliced so that `nbytes` and `get_total_buffer_size` return different values?

##########
File path: python/pyarrow/array.pxi
##########
@@ -988,12 +988,53 @@ cdef class Array(_PandasConvertible):
         """
         Total number of bytes consumed by the elements of the array.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        return self.get_referenced_buffer_size()
+
+    def get_referenced_buffer_size(self):
+        """
+        Returns the sum of bytes from all buffer ranges referenced
+
+        Unlike TotalBufferSize this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will only be counted multiple times.

Review comment:
       ```suggestion
           If buffers are shared between arrays then the shared
           portion will be counted multiple times.
   ```
   
   Looks like this typo is in the C++ description too if you want to fix it there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r774615869



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,12 +988,64 @@ cdef class Array(_PandasConvertible):
         """
         Total number of bytes consumed by the elements of the array.
         """
+        return self.get_refererenced_buffer_size()
+
+    def get_total_size_by_buffer(self):

Review comment:
       How is this now different from `get_total_buffer_size`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r786540742



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,51 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()

Review comment:
       Yes, I also thought about that, but still used that API. :) 
   
   I will modify the corresponding calls in rest of the functions. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1018108378


   Benchmark runs are scheduled for baseline = cba3ee6017d1cf621a8cd805aa1963fae7af9ad9 and contender = 658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11. 658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e0556647972043ae8389d205a958e404...28e576650b9645fdb724548232d4357d/)
   [Failed :arrow_down:0.45% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a476e70b811f468f96a0db86ec338f0c...ac5f30256ae84501a37df1bdac42ba8b/)
   [Finished :arrow_down:0.26% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1e92b27e6081437ba54fcac95a79b50b...d4669296c34c4882b5fef1cb8a3c24d9/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r787462209



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,43 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            CResult[int64_t] c_res_buffer

Review comment:
       Nit: The "buffer" part of `c_res_buffer` is a little confusing.  Maybe just `c_size_res` or something like that?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r774615391



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,12 +988,64 @@ cdef class Array(_PandasConvertible):
         """
         Total number of bytes consumed by the elements of the array.
         """
+        return self.get_refererenced_buffer_size()
+
+    def get_total_size_by_buffer(self):
+        """
+        Total size of the buffer calculated by the sum of all buffers 
+        in the array
+        """
         size = 0
         for buf in self.buffers():
             if buf is not None:
                 size += buf.size
         return size
 
+    def get_refererenced_buffer_size(self):

Review comment:
       ```suggestion
       def get_referenced_buffer_size(self):
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

vibhatha commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1000357457


   > I find the `get_referenced_buffer_size` name a bit confusing, but I assume this is just following the naming scheme in C++. The reason I find this a bit confusing is that "referenced" could also be interpreted as: "give me the size of all buffers I have a reference to, regardless of whether the full buffer is being used or not"
   
   I see your point, how about just saying `get_buffer_size`? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r781082338



##########
File path: python/pyarrow/array.pxi
##########
@@ -986,14 +986,49 @@ cdef class Array(_PandasConvertible):
     @property
     def nbytes(self):
         """
-        Total number of bytes consumed by the elements of the array.
+        Returns the sum of bytes from all buffer ranges referenced
+
+        Unlike TotalBufferSize this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        Dictionary arrays will always be counted in their entirety

Review comment:
       ```suggestion
           The dictionary of dictionary arrays will always be counted in their entirety
   ```

##########
File path: python/pyarrow/array.pxi
##########
@@ -986,14 +986,49 @@ cdef class Array(_PandasConvertible):
     @property
     def nbytes(self):
         """
-        Total number of bytes consumed by the elements of the array.
+        Returns the sum of bytes from all buffer ranges referenced

Review comment:
       I personally find the original text clearer. We can maybe leave it as is, and add the "buffer ranges referenced" in the next paragraph below?

##########
File path: python/pyarrow/array.pxi
##########
@@ -986,14 +986,49 @@ cdef class Array(_PandasConvertible):
     @property
     def nbytes(self):
         """
-        Total number of bytes consumed by the elements of the array.
+        Returns the sum of bytes from all buffer ranges referenced
+
+        Unlike TotalBufferSize this method will account for array

Review comment:
       ```suggestion
           Unlike `get_total_buffer_size` this method will account for array
   ```

##########
File path: python/pyarrow/array.pxi
##########
@@ -986,14 +986,49 @@ cdef class Array(_PandasConvertible):
     @property
     def nbytes(self):
         """
-        Total number of bytes consumed by the elements of the array.
+        Returns the sum of bytes from all buffer ranges referenced
+
+        Unlike TotalBufferSize this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        Dictionary arrays will always be counted in their entirety
+        even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()
+        c_res_buffer = ReferencedBufferSize(deref(c_array))
+        size = GetResultValue(c_res_buffer)
         return size
 
+    def get_total_buffer_size(self):
+        """
+        The sum of bytes in each buffer referenced by the array

Review comment:
       ```suggestion
           The sum of bytes in each buffer referenced by the array.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

westonpace commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1018043957


   Thanks.  Unfortunately, it looks like this needs a rebase now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1018108378


   Benchmark runs are scheduled for baseline = cba3ee6017d1cf621a8cd805aa1963fae7af9ad9 and contender = 658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11. 658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e0556647972043ae8389d205a958e404...28e576650b9645fdb724548232d4357d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a476e70b811f468f96a0db86ec338f0c...ac5f30256ae84501a37df1bdac42ba8b/)
   [Finished :arrow_down:0.26% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1e92b27e6081437ba54fcac95a79b50b...d4669296c34c4882b5fef1cb8a3c24d9/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1018053870


   > Thanks. Unfortunately, it looks like this needs a rebase now.
   
   Of course. I will do it shortly 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-997804044






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1015206272


   @jorisvandenbossche also do you see why the Python/AMD64 Conda Python 3.9 Sphinx & Numpydoc workflow is failing? 
   I also cannot build the docs locally due to the same error. But not quite clear why it is happening. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r788323147



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,13 +988,43 @@ cdef class Array(_PandasConvertible):
     def nbytes(self):
         """
         Total number of bytes consumed by the elements of the array.
+
+        In other words, the sum of bytes from all buffer 
+        ranges referenced.
+
+        Unlike `get_total_buffer_size` this method will account for array
+        offsets.
+
+        If buffers are shared between arrays then the shared
+        portion will be counted multiple times.
+
+        The dictionary of dictionary arrays will always be counted in their 
+        entirety even if the array only references a portion of the dictionary.
         """
-        size = 0
-        for buf in self.buffers():
-            if buf is not None:
-                size += buf.size
+        cdef:
+            CResult[int64_t] c_res_buffer

Review comment:
       Sure, will update this. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1018108378


   Benchmark runs are scheduled for baseline = cba3ee6017d1cf621a8cd805aa1963fae7af9ad9 and contender = 658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11. 658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e0556647972043ae8389d205a958e404...28e576650b9645fdb724548232d4357d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a476e70b811f468f96a0db86ec338f0c...ac5f30256ae84501a37df1bdac42ba8b/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1e92b27e6081437ba54fcac95a79b50b...d4669296c34c4882b5fef1cb8a3c24d9/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r776135785



##########
File path: python/pyarrow/array.pxi
##########
@@ -988,12 +988,64 @@ cdef class Array(_PandasConvertible):
         """
         Total number of bytes consumed by the elements of the array.
         """
+        return self.get_refererenced_buffer_size()
+
+    def get_total_size_by_buffer(self):

Review comment:
       I have added both `TotalBufferSize` and `ReferencedBufferSize` with test cases. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1018108378


   Benchmark runs are scheduled for baseline = cba3ee6017d1cf621a8cd805aa1963fae7af9ad9 and contender = 658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11. 658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e0556647972043ae8389d205a958e404...28e576650b9645fdb724548232d4357d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a476e70b811f468f96a0db86ec338f0c...ac5f30256ae84501a37df1bdac42ba8b/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1e92b27e6081437ba54fcac95a79b50b...d4669296c34c4882b5fef1cb8a3c24d9/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1016053229


   @jorisvandenbossche updated the PR :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a change in pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

vibhatha commented on a change in pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#discussion_r772744416



##########
File path: python/pyarrow/array.pxi
##########
@@ -994,6 +994,27 @@ cdef class Array(_PandasConvertible):
                 size += buf.size
         return size
 
+    def get_refererenced_buffer_size(self):
+        cdef:
+            shared_ptr[CArray] shd_ptr_c_array
+            CArray *c_array
+            CResult[int64_t] c_res_buffer
+        
+        shd_ptr_c_array = pyarrow_unwrap_array(self)
+        c_array = shd_ptr_c_array.get()
+        c_res_buffer = ReferencedBufferSize(c_array[0])
+        size = GetResultValue(c_res_buffer)
+        return size
+
+    @property
+    def get_buffer_size(self):

Review comment:
       I see. I will update the PR 👍




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python [WIP]

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1000351014


   I find the `get_referenced_buffer_size` name a bit confusing, but I assume this is just following the naming scheme in C++. The reason I find this a bit confusing is that "referenced" could also be interpreted as: "give me the size of all buffers I have a reference to, regardless of whether the full buffer is being used or not"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #11993: ARROW-15153: [Python] Expose ReferencedBufferSize to python

Posted by GitBox <gi...@apache.org>.

vibhatha commented on pull request #11993:
URL: https://github.com/apache/arrow/pull/11993#issuecomment-1005563766


   @westonpace thank you for the review. I will update the PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org