You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/18 23:19:47 UTC

[GitHub] [arrow-datafusion] mmuru opened a new issue #906: [Python]: register custom datasource

mmuru opened a new issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906


   In Python side, how to register a custom datasource? The register_table method in the ExecutionContext is not available. In my use case,  read a delta table and register it as a table with datafusion and perform sql query. Thanks for your help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913787884


   yeah, I would expect dataframes to be unnamed, it seems like a `register_record_batches` or `register_table` method would be more idiomatic?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mmuru commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

mmuru commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913382156


   @houqp: I've created a dataframe from in-memory data. Is there a way to create a view and run the sql query? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913783742


   > I think we would need to expose the memtable interface to the python binding first for sql support.
   
   it actually uses the memtable API, but assigns it a random name. A way to go is to add an optional parameter to this function so that it supports named and un-named versions. Usually for DataFrames we do not care about the name, but for SQL we must name it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913783040


   I think we would need to expose the memtable interface to the python binding first for sql support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mmuru commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

mmuru commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913692560


   @jorgecarleitao: Yes, I checked python tests but I did not find it. The current python tests only have register_parquet/csv/udf and then can able to run the sql query. We need ctx.register_table("t", df) something like that for in-memory datasources. Please, can you provide the sample code or point to me the reference? Thanks.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-901653839


   It looks like we need to extend the python binding to expose this method. PR welcome :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp edited a comment on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

houqp edited a comment on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913794437


   Ha, I wasn't suggesting renaming `create_dataframe` to a name has table in it, but rather create a new one :) I am also heavily influenced by spark's API, so I think there is value in supporting both audiences as you mentioned especially considering the effort from our end is very low.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913794437


   Ha, I wasn't suggesting renaming `create_dataframe` to a name has table in it :) I am also heavily influenced by spark's API, so I think there is value in supporting both audiences as you mentioned especially considering the effort from our end is very low.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913791972


   It was named after spark's [CreateDataFrame](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html#pyspark-sql-sparksession-createdataframe) stylized for Python that does not use Camel Case for functions, but I was more focused in making UDFs and UDAFs work at zero copy at that point.
   
   I did not use "table" because there is no semantic difference between a "table" and a "dataframe", and "dataframe" ended up being the facto way of expressing "a programatic excel sheet" in Python, R, etc.
   
   I usually think of Python for ETL because the DataFrame API allows a more idiomatic way of expressing "chunks of SQL", testing of those chunks, etc, which is less prone to SQL injections than the typical `"SELECT * FROM {}".format(table_name)`, which is why I placed the DataFrame API as the core API of how to manage tables and leave the SQL _in Python_ for expressions, like Pandas does.
   
   I guess it depends what the target audience is, and so maybe both?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913388536


   yes could you check the tests? We do that to test the Python API. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-913707277


   Sorry, it was on the README, but you are right, it only works for the DataFrame API:
   
   ```python
   batch = pyarrow.RecordBatch.from_arrays(
       [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
       names=["a", "b"],
   )
   df = ctx.create_dataframe([[batch]])
   ```
   
   Sorry for the noise. :/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mmuru commented on issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

mmuru commented on issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906#issuecomment-914023275


   @houqp & @jorgecarleitao: Thank you both. I added register_memtable function in context.rs and able to perform sql query on it. I will submit my PR tomorrow.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp closed issue #906: [Python]: register custom datasource

Posted by GitBox <gi...@apache.org>.

houqp closed issue #906:
URL: https://github.com/apache/arrow-datafusion/issues/906


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org