You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/15 12:02:13 UTC

[GitHub] [arrow] zagto commented on a diff in pull request #13333: ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format

zagto commented on code in PR #13333:
URL: https://github.com/apache/arrow/pull/13333#discussion_r897892969


##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,85 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-length-encoded-layout:
+
+Run-Length-encoded Layout
+-------------------------
+
+Run-Length is a data representation that represents data as sequences of the
+same a, called runs. Each run is represented as a value, and an integer
+describing how often this value is repeated.
+
+Any array can be run-length-encoded. A run-length encoded array has a single
+buffer holding as many 32-bit integers, as there are runs. The actual values are
+hold in a child array, which is just a regular array
+
+The dictionary is stored as an optional
+property of an array. When a field is dictionary encoded, the values are
+represented by an array of non-negative integers representing the index of the
+value in the dictionary. The memory layout for a dictionary-encoded array is
+the same as that of a primitive integer layout. The dictionary is handled as a
+separate columnar array with its own respective layout.
+
+As an example, you could have the following data: ::
+
+    type: Float32
+
+    [1.0, 1.0, 1.0, 1.0, null, null, 2.0]
+
+In Run-length-encoded form, this could appear as:
+
+::
+
+    * Length: 3, Null count: 2
+    * Accumulated run lengths buffer:
+
+      | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes  6-63           |
+      |-------------|-------------|-------------|-----------------------|
+      | 4           | 6           | 7           | unspecified (padding) |

Review Comment:
   > How are top-level nulls interpreted? Or are they disallowed?
   
   The idea for now is that the top level only has a single buffer, holding the run length values. So it is not possible to a have a null three
   
   > Shouldn't these be 4, 2, 1?
   
   No they hold the accumulated length of all runs from the first to the current one. The paragraph above should have said that, sorry.
   
   The reason for storing accumulated lengths is to make random access to a logical index possible using binary search. We can still get the length of the current run efficiently by subtracting the two adjacent values
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org