You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/05 09:09:00 UTC

[GitHub] [arrow] arthursunbao opened a new issue #10885: Does arrow intends to support IDL schema like protobuf?

arthursunbao opened a new issue #10885:
URL: https://github.com/apache/arrow/issues/10885


   Hi All, 
   
   I am trying to use arrow on multiple language implementions (java, c++) but found out that there is currently no IDL schema like protobuf to support both languages to import a schema and then both language code implmentation can parse the data from disk or in memory based on the same IDL schema. Thanks
   
   Jason
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] arthursunbao commented on issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

arthursunbao commented on issue #10885:
URL: https://github.com/apache/arrow/issues/10885#issuecomment-894037170


   Hi  westonpace,
   
   Thanks for your quick response. 
   
   Our scenario is like this: 
   
   We have a recommendation system and we want to transfer the user data from kafka and hive to online redis-like storage. We found that arrow has good columar storage capabilities and deserializes data without needing to parse the entire data content like protobuf so we use arrow to serialize and compress the user data into binary using arrowStreamWriter in kafka and hive 
   
   However, in our scenario, the data schema for every user is different, so we want to keep an IDL schema in an independent management system for each user so that when the serialized data is in redis, the third party system that loads the redis arrow-serialized data can read the user's unique schema and unserialized the binary data using arrowFileReader. 
   
   We dig on the arrow Java api and found that when reading data using arrowFileReader, we first need to do like this:
   
   ``
   RootAllocator allocator = new RootAllocator();
   VectorSchemaRoot schemaRoot = VectorSchemaRoot.create(UserSchema.schema(), allocator);
   FileOutputStream fileOutputStream = new FileOutputStream(FILE_PATH);
   ArrowFileWriter arrowFileWriter = new ArrowFileWriter(schemaRoot, null, fileOutputStream.getChannel())) 
   ``
   
   So basically if the user uses the Java sdk, he needs to keep an UserSchema.schema() java file.
   So what if the user wants to use C++ sdk to read the scemea, does it mean that he need to keep an C++ struct as well?
   
   Thanks in advance
   Jason
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #10885:
URL: https://github.com/apache/arrow/issues/10885#issuecomment-893706079


   Can you expand a little bit on what problem you are trying to solve?  There is a schema.  It can be serialized to parquet and to the Arrow IPC format.  It specifies the name and data type for an ordered list of columns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] arthursunbao edited a comment on issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

arthursunbao edited a comment on issue #10885:
URL: https://github.com/apache/arrow/issues/10885#issuecomment-894037170


   Hi  westonpace,
   
   Thanks for your quick response. 
   
   Our scenario is like this: 
   
   We have a recommendation system and we want to transfer the user data from kafka and hive to online redis-like storage. We found that arrow has good columar storage capabilities and deserializes data without needing to parse the entire data content like protobuf so we use arrow to serialize and compress the user data into binary using arrowStreamWriter in kafka and hive 
   
   However, in our scenario, the data schema for every user is different, so we want to keep an IDL schema in an independent management system for each user so that when the serialized data is in redis, the third party system that loads the redis arrow-serialized data can read the user's unique schema and unserialized the binary data using arrowFileReader. 
   
   We dig on the arrow Java api and found that when reading data using arrowFileReader, we first need to do like this:
   
   ``
   RootAllocator allocator = new RootAllocator();
   
   VectorSchemaRoot schemaRoot = VectorSchemaRoot.create(UserSchema.schema(), allocator);
   
   FileOutputStream fileOutputStream = new FileOutputStream(FILE_PATH);
   
   ArrowFileWriter arrowFileWriter = new ArrowFileWriter(schemaRoot, null, fileOutputStream.getChannel())) 
   
   ``
   
   So basically if the user uses the Java sdk, he needs to keep an UserSchema.schema() java file.
   So what if the user wants to use C++ sdk to read the scemea, does it mean that he need to keep an C++ struct as well?
   
   Thanks in advance
   Jason
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] arthursunbao closed issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

arthursunbao closed issue #10885:
URL: https://github.com/apache/arrow/issues/10885


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #10885:
URL: https://github.com/apache/arrow/issues/10885#issuecomment-894450914

The Java type `org.apache.arrow.vector.types.pojo.Schema` should be the same concept (not necessarily the same memory) as the C++ type `arrow::Schema` or in Python `pyarrow.Schema`.

In the Arrow column format (flatbuffers reference) this is defined here: https://github.com/apache/arrow/blob/master/format/Schema.fbs

Parquet also has a schema concept which is generally compatible with Arrow's schema.

For sharing the schema between languages or processes there are a few serialization choices:

* You can save an empty table in the IPC format (sometimes called feather)
* This is probably the best choice for file storage of a schema
* You can save an empty table in the parquet format
* You can use the [C data interface](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowschema-structure) which defines a common memory representation of a schema (among other things)
* This is probably the best choice when you don't want to use a file

Java currently supports the IPC format so you should be able to read and write IPC files with empty tables. That should allow you to save, restore, and transfer a schema. You can also save the schema in Java and load it in C++. You could either use a temporary file or a shared buffer.

There is work in progress to add the C data interface to Java. This will allow you to copy schemas back and forth between Java and C++ without going to an intermediate file / byte array.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #10885:
URL: https://github.com/apache/arrow/issues/10885#issuecomment-895485912


   > Thanks, so you mean the IPC feather file format is the output of ArvoStreamWriter, which is a binary file, but just with no data in it right?
   
   I'm sorry but I don't know how AvroStreamWriter works.
   
   > The file can be shared with C++ and Java SDK?
   
   Yes, an IPC file can be read and written by both C++ and Java.
   
   > But I could not found a method to serialize a vector with its VectorSchemaRoot independently to file and now I can only create a new VectorSchemaRoot and put the vector inside, which is a bit troublesome. I don't know if there is a convenient way to do this?
   
   I, personally, am not very familiar with the Java API.  You may get a quicker response to this question by asking on the user mailing list: https://arrow.apache.org/community/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] arthursunbao commented on issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

arthursunbao commented on issue #10885:
URL: https://github.com/apache/arrow/issues/10885#issuecomment-895778664


   OK. Thanks. That all I want to ask


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] arthursunbao commented on issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

arthursunbao commented on issue #10885:
URL: https://github.com/apache/arrow/issues/10885#issuecomment-894989430


   Thanks, so you mean the IPC feather file format is the output of ArvoStreamWriter, which is a binary file, but just with no data in it right? The file can be shared with C++ and Java SDK? 
   
   By the way, I have another question that I want to make in inquery. 
   
   So basically I have a IPC format schema which has four fileds: name, gender, birth and extra_info and I want to export four vector fields into Redis HASH structure with KV pairs like 
   
   Redis Key:  "primarykey"
   Redis HASH Value1       key: name, value: arrow serialized name data;  
   Redis HASH Value2      key: gender, value: arrow serialized gender data;  
   ...........
   
   But I could not found a method to serialize a vector with its VectorSchemaRoot independently to file and now I can only create a new VectorSchemaRoot and put the vector inside, which is a bit troublesome.  I don't know if there is a convenient way to do this?
   
   Thanks Jason


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed issue #10885: Does arrow intends to support IDL schema like protobuf?

Posted by GitBox <gi...@apache.org>.

nealrichardson closed issue #10885:
URL: https://github.com/apache/arrow/issues/10885


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org