You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/12 16:13:55 UTC

[GitHub] [arrow] ilya-biryukov opened a new issue #9178: Support for the binary SQL type in rust/datafusion

ilya-biryukov opened a new issue #9178:
URL: https://github.com/apache/arrow/issues/9178


   Hi folks,
   
   I'm new here and let me start by saying I'm sorry if I'm asking the question in the wrong place.
   More than happy to ask again in a proper channel, just let me know which one.
   
   I'm trying to implement a few user-defined functions (UDF) in rust/datafusion (in particular, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog)).
   For my use-case it's essential to support storing and later retrieving the partially computed state.
   Therefore, my UDF will have to accept and return the binary blob in some form.
   
   However, the binary SQL type is listed as unsupported in the `README.md` and I'm stuck.
   Are there any plans to support it?
   
   If not, any guidance/possible obstacles on how to best implement it?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on issue #9178: Support for the binary SQL type in rust/datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #9178:
URL: https://github.com/apache/arrow/issues/9178#issuecomment-760168643


   Hi @ilya-biryukov  -- the arrow dev mailing list is probably the best place to ask such a question. You can subscribe or look at the archives as described under the "Mailing Lists" heading of https://arrow.apache.org/community/
   
   I don't know of any imminent plans to support the binary SQL type in DataFusion, but the underlying Arrow libraries have the requisite support I think. We just need to plumb through such support through the frontend and query layer. @jorgecarleitao  or @Dandandan  or @seddonm1  do you know of any plans to add support for BINARY / VARBINARY in DataFusion?
   
   
   
   Thanks for the ping @ovr. Things have been busy for me at work the last few days so I am somewhat behind on Arrow


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ovr commented on issue #9178: Support for the binary SQL type in rust/datafusion

Posted by GitBox <gi...@apache.org>.

ovr commented on issue #9178:
URL: https://github.com/apache/arrow/issues/9178#issuecomment-759777361


   cC @andygrove @alamb


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb edited a comment on issue #9178: Support for the binary SQL type in rust/datafusion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #9178:
URL: https://github.com/apache/arrow/issues/9178#issuecomment-760168643


   Hi @ilya-biryukov  -- the arrow dev mailing list is probably the best place to ask such a question. You can subscribe or look at the archives as described under the "Mailing Lists" heading of https://arrow.apache.org/community/
   
   I don't know of any imminent plans to support the binary SQL type in DataFusion, but the underlying Arrow libraries have the requisite support I think (e.g. the `Binary`, and `LargeBinary` types [source link](https://github.com/apache/arrow/blob/master/rust/arrow/src/datatypes.rs#L118-L123). We just need to plumb through such support through the frontend and query layer. @jorgecarleitao  or @Dandandan  or @seddonm1  do you know of any plans to add support for BINARY / VARBINARY in DataFusion?
   
   
   
   Thanks for the ping @ovr. Things have been busy for me at work the last few days so I am somewhat behind on Arrow


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ilya-biryukov commented on issue #9178: Support for the binary SQL type in rust/datafusion

Posted by GitBox <gi...@apache.org>.

ilya-biryukov commented on issue #9178:
URL: https://github.com/apache/arrow/issues/9178#issuecomment-766711505


   Hi @jorgecarleitao,
   
   I forgot to mention my use-case involves an aggregate UDF. The `Accumulator` trait requires accepting and returning instances of `ScalarValue`. It was not hard to add minimal support in `datafusion` (see https://github.com/cube-js/arrow/commit/5813558cca8f70b7af901709ed367cf3d96d7f49 and https://github.com/cube-js/arrow/commit/81eabb45654f7793920b698bd1eafe031afda93e in our fork). I have not filed a PR yet due to the time constraints.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ilya-biryukov commented on issue #9178: Support for the binary SQL type in rust/datafusion

Posted by GitBox <gi...@apache.org>.

ilya-biryukov commented on issue #9178:
URL: https://github.com/apache/arrow/issues/9178#issuecomment-766711505


   Hi @jorgecarleitao,
   
   I forgot to mention my use-case involves an aggregate UDF. The `Accumulator` trait requires accepting and returning instances of `ScalarValue`. It was not hard to add minimal support in `datafusion` (see https://github.com/cube-js/arrow/commit/5813558cca8f70b7af901709ed367cf3d96d7f49 and https://github.com/cube-js/arrow/commit/81eabb45654f7793920b698bd1eafe031afda93e in our fork). I have not filed a PR yet due to the time constraints.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm closed issue #9178: Support for the binary SQL type in rust/datafusion

Posted by GitBox <gi...@apache.org>.

wesm closed issue #9178:
URL: https://github.com/apache/arrow/issues/9178


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on issue #9178: Support for the binary SQL type in rust/datafusion

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #9178:
URL: https://github.com/apache/arrow/issues/9178#issuecomment-765142713


   Hi @ilya-biryukov, have you tried?
   
   The following works for me (based on the example `simple_udf`):
   
   ```rust
   use arrow::{
       array::{ArrayRef, BinaryArray, Int64Array},
       datatypes::DataType,
       record_batch::RecordBatch,
       util::pretty,
   };
   
   use datafusion::error::Result;
   use datafusion::{physical_plan::functions::ScalarFunctionImplementation, prelude::*};
   use std::sync::Arc;
   
   // create local execution context with an in-memory table
   fn create_context() -> Result<ExecutionContext> {
       use arrow::datatypes::{Field, Schema};
       use datafusion::datasource::MemTable;
       // define a schema.
       let schema = Arc::new(Schema::new(vec![
           Field::new("c", DataType::Binary, false),
       ]));
   
       let a: &[u8] = b"aaaa";
       // define data.
       let batch = RecordBatch::try_new(
           schema.clone(),
           vec![
               Arc::new(BinaryArray::from(vec![Some(a), None, None, None])),
           ],
       )?;
   
       // declare a new context. In spark API, this corresponds to a new spark SQLsession
       let mut ctx = ExecutionContext::new();
   
       // declare a table in memory. In spark API, this corresponds to createDataFrame(...).
       let provider = MemTable::try_new(schema, vec![vec![batch]])?;
       ctx.register_table("t", Box::new(provider));
       Ok(ctx)
   }
   
   /// In this example we will declare a single-type, single return type UDF that exponentiates f64, a^b
   #[tokio::main]
   async fn main() -> Result<()> {
       let mut ctx = create_context()?;
   
       // First, declare the actual implementation of the calculation
       let len: ScalarFunctionImplementation = Arc::new(|args: &[ArrayRef]| {
           // in DataFusion, all `args` and output are dynamically-typed arrays, which means that we need to:
           // 1. cast the values to the type we want
           // 2. perform the computation for every element in the array (using a loop or SIMD)
   
           // this is guaranteed by DataFusion based on the function's signature.
           assert_eq!(args.len(), 1);
   
           let value = &args[0]
               .as_any()
               .downcast_ref::<BinaryArray>()
               .expect("cast failed");
   
           // 2. run the UDF
           let array: Int64Array = value.iter().map(|base| {
               // in arrow, any value can be null.
               base.map(|x| x.len() as i64)
           }).collect();
           Ok(Arc::new(array))
       });
   
       // Next:
       // * give it a name so that it shows nicely when the plan is printed
       // * declare what input it expects
       // * declare its return type
       let len = create_udf(
           "len",
           vec![DataType::Binary],
           Arc::new(DataType::Int64),
           len,
       );
   
       // at this point, we can use it or register it, depending on the use-case:
       // * if the UDF is expected to be used throughout the program in different contexts,
       //   we can register it, and call it later:
       ctx.register_udf(len.clone()); // clone is only required in this example because we show both usages
   
       // * if the UDF is expected to be used directly in the scope, `.call` it directly:
       let expr = len.call(vec![col("c")]);
   
       // get a DataFrame from the context
       let df = ctx.table("t")?;
   
       // equivalent to `'SELECT pow(a, b), pow(a, b) AS pow1 FROM t'`
       let df = df.select(vec![
           expr,
       ])?;
   
       // execute the query
       let results = df.collect().await?;
   
       // print the results
       pretty::print_batches(&results)?;
   
       Ok(())
   }
   ```
   
   ```
   +--------+
   | len(c) |
   +--------+
   | 4      |
   |        |
   |        |
   |        |
   +--------+
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org