You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/15 06:51:20 UTC

[GitHub] [arrow] liusitan opened a new issue, #13875: How does arrow parse its ipc format?

liusitan opened a new issue, #13875:
URL: https://github.com/apache/arrow/issues/13875

   Hi there, I am trying to hack the arrow IPC format, I am confused about how does arrow differentiate between different type in record batch buffers and parse it.
   <img width="671" alt="image" src="https://user-images.githubusercontent.com/20542539/184588362-d528770b-79e3-4307-be75-9dfbc01e60a4.png">
   for example, now I store a data frame
   ```
   name | age | balance
   'jack'    | 12.  |  100.23
   'Jennie' | 24   | 2000.34
   ```
   I believe 'jack' and 'jennie' is stored in one buffer
   and 12 and 24 are stored in another buffer 
   100.23 and 2000.34 is stored in another buffer
   
   how does the arrow deserializer know which tools to parse the buffer? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] liusitan commented on issue #13875: How does arrow parse its ipc format?

Posted by GitBox <gi...@apache.org>.

liusitan commented on issue #13875:
URL: https://github.com/apache/arrow/issues/13875#issuecomment-1215378788

   hi @zeroshade  thanks for sharing and replying, I am confused about the field node that you wrote down because according to the fbs file on the main branch, it did not define the type, are you referring to something else?
   
   > 
   FieldNode 0: Utf8 name='name', length=2, nulls=0
   FieldNode 1: Int32 name='age', length=2, nulls=0
   FieldNode 2: Float64 name='balance', length=2, nulls=0
   
   <img width="1122" alt="image" src="https://user-images.githubusercontent.com/20542539/184680682-6289b11f-a418-40ef-b26c-38c5f75f5b9c.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on issue #13875: How does arrow parse its ipc format?

Posted by GitBox <gi...@apache.org>.

zeroshade commented on issue #13875:
URL: https://github.com/apache/arrow/issues/13875#issuecomment-1215495286

   > I can use the type underlined above to determine the type of the field node, right?
   
   Correct!
   
   > I notice it comments that _ This is the type of the decoded value if the field is dictionary encoded_. what does dictionary encoded mean? is the example below dictionary encoded...
   
   Dictionary encoding is a specific type of Arrow array that is ideal for large arrays with low cardinality. You can find a full description of dictionary-encoded arrays [here](https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout) in the Arrow docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] liusitan commented on issue #13875: How does arrow parse its ipc format?

Posted by GitBox <gi...@apache.org>.

liusitan commented on issue #13875:
URL: https://github.com/apache/arrow/issues/13875#issuecomment-1215518168

   got it thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] liusitan commented on issue #13875: How does arrow parse its ipc format?

Posted by GitBox <gi...@apache.org>.

liusitan commented on issue #13875:
URL: https://github.com/apache/arrow/issues/13875#issuecomment-1215419703

   Thanks for the lightning response first! got it, I am looking at the schema.fbs now
   <img width="1207" alt="image" src="https://user-images.githubusercontent.com/20542539/184681707-97e8fffc-3147-4b4e-84fa-bde7bc01a5c2.png">
   I guess you are saying that the field in the schema type includes the type info, and according to the Field fbs definition, Field contains 
   <img width="1150" alt="image" src="https://user-images.githubusercontent.com/20542539/184682176-ec420049-af15-4011-bcd3-031364f42cf8.png">
   I can use the type underlined above to determine the type of the field node. 
   I notice it comments that   _ This is the type of the decoded value if the field is **dictionary encoded**_. what does dictionary encoded mean?  is the example below dictionary encoded 
   name | age | balance
   'jack'    | 12.  |  100.23
   'Jennie' | 24   | 2000.34
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] liusitan closed issue #13875: How does arrow parse its ipc format?

Posted by GitBox <gi...@apache.org>.

liusitan closed issue #13875: How does arrow parse its ipc format?
URL: https://github.com/apache/arrow/issues/13875


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on issue #13875: How does arrow parse its ipc format?

Posted by GitBox <gi...@apache.org>.

zeroshade commented on issue #13875:
URL: https://github.com/apache/arrow/issues/13875#issuecomment-1215324205

   Hey @liusitan, the first message in an Arrow IPC stream should always be a `Schema` message which describes the actual schema of the following record batch messages. So anything processing an IPC `RecordBatch` message should already have the expected schema for that record batch. 
   
   Then you have the `nodes` field of the flatbuffer message which is a flattened version of the logical schema. So for example, if you had the following schema:
   
   ```
   col1: Struct<a: Int32, b: List<item: Int64>, c: Float64>
   col2: Utf8
   ```
   
   That `Nodes` list would be:
   
   ```
   FieldNode 0: Struct name='col1'
   FieldNode 1: Int32 name='a'
   FieldNode 2: List name='b'
   FieldNode 3: Int64 name='item'
   FieldNode 4: Float64 name='c'
   FieldNode 5: Utf8 name='col2'
   ```
   
   Each of those `FieldNode` objects contains the appropriate metadata describing the length of the array and the number of nulls. The recordbatch message also has that `buffers` list which is a list of the offset and length of the buffers that are sent in the message body. Thus, for the above described example we'd expect the buffers described to correspond to:
   
   ```
   buffer 0: field 0 validity bitmap
   buffer 1: field 1 validity bitmap
   buffer 2: field 1 values buffer (ie: a buffer of int32 values)
   buffer 3: field 2 validity bitmap
   buffer 4: field 2 offsets (ie: the offsets buffer for the List array, a buffer of int32 values)
   buffer 5: field 3 validity bitmap
   buffer 6: field 3 values (ie: a buffer of int64 values)
   buffer 7: field 4 validity bitmap
   buffer 8: field 4 values (ie: a buffer of float64 values)
   buffer 9: field 5 validity bitmap
   buffer 10: field 5 offsets (ie: the offsets for col2, the utf8 array, a buffer of int32 values)
   buffer 11: field 5 data (ie: the data buffer for the utf8 column, all of the string values)
   ```
   
   For your provided example:
   
   ```
   name | age | balance
   'jack'    | 12.  |  100.23
   'Jennie' | 24   | 2000.34
   ```
   
   The first message in the stream would be the schema:
   
   ```
   name: utf8
   age: Int32
   balance: Float64
   ```
   
   the `nodes` in the `RecordBatch` message (the second message of the stream) would contain:
   
   ```
   FieldNode 0: Utf8 name='name', length=2, nulls=0
   FieldNode 1: Int32 name='age', length=2, nulls=0
   FieldNode 2: Float64 name='balance', length=2, nulls=0
   ```
   
   So now you can parse the buffers because you know the types:
   
   ```
   buffer 0: validity bitmap for name column (should be 1 byte)
   buffer 1: offsets for name column (should be the int32 values [0, 4, 10])
   buffer 2: data buffer for name column (should be 'jackJennie')
   buffer 3: validity bitmap for age column (should be 1 byte)
   buffer 4: values for age column (should be int32 values [12, 24], ie: 8 bytes)
   buffer 5: validity bitmap for balance column
   buffer 6: values for balance column (should be the bytes for [100.23, 2000.34], ie: 16 bytes)
   ```
   
   Hope that helps


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on issue #13875: How does arrow parse its ipc format?

Posted by GitBox <gi...@apache.org>.

zeroshade commented on issue #13875:
URL: https://github.com/apache/arrow/issues/13875#issuecomment-1215388732

   @liusitan the `FieldNode`'s type is coming from the Schema that we already know from the first message (I included the type in my example to help illustrate which node they corresponded to)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] liusitan commented on issue #13875: How does arrow parse its ipc format?

Posted by GitBox <gi...@apache.org>.

liusitan commented on issue #13875:
URL: https://github.com/apache/arrow/issues/13875#issuecomment-1214694078

the reason why I am hacking the Arrow ipc format is that recently I am implementing a fuse filesystem for vineyard, which is an immutable storage manager that utilize the columnar format as the arrow.

We decided to enable our clients to access vineyard objects by reading from the Arrow ipc format. which means when the client wants to access the objects stored in the vineyard, the fuse file system will searlize the corresponding vienayrd objects on the fly, store it in the fuse process, and provide the serialized Arrow-formatted vineyard objects to clients. However, this approach may lead to heavy memory usage.

We are thinking, is it possible to create a mapping between the Arrow ipc format to the vineyard objects, in terms of the information stored in the vineyard objects' metadata, it's totally possible, especially in terms of the dataframe, we store that in units of column as well.

Theoretically, if a user wants to access the 100 byte to 200 btyes of the Arrow-formatted vineyard objects, conceptually, that's a range of data in the first column, my implementation can realize its conceptual representation from the byte range, and grab the data from vineyard, serialized data, provide what client wants.

Practically, I haven't found a way to precompute the sizes of each part of the serialized Arrow ipc format for now, given the documentation so far. After doing some question compression, I raised the question above.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org