You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sedona.apache.org by "Jia Yu (Jira)" <ji...@apache.org> on 2022/09/15 01:50:00 UTC
[jira] [Commented] (SEDONA-166) Provide DataFrame Style API

    [ https://issues.apache.org/jira/browse/SEDONA-166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605059#comment-17605059 ] 

Jia Yu commented on SEDONA-166:
-------------------------------

[~imbruced] [~dougdennis] 

Doug (and CCing Pawel, the main contributor of Sedona Python), do you think there is a way that we can achieve this in both Scala/Java and Python?

> Provide DataFrame Style API
> ---------------------------
>
>                 Key: SEDONA-166
>                 URL: https://issues.apache.org/jira/browse/SEDONA-166
>             Project: Apache Sedona
>          Issue Type: New Feature
>            Reporter: Doug Dennis
>            Priority: Major
>
> Spark provides an API to operate on Column types. Especially in python, this API is by far the most common pattern I have seen used when developing Spark applications. Currently, Sedona only provides the SQL API which requires either generating a temporary view and using the sql method, using the expr function, or using the selectExpr method. There is no performance loss but it does cause disruption when writing applications using sedona and it makes certain tasks tricky to accomplish.
> I'll use an example of using a Sedona function inside of a transform function call to generate geometry from a list of coordinates. Assume the variable spark is a spark session. Here is how it can be accomplished today (I omit the version with expr since it is nearly identical to selectExpr):
> {code:python}
> df = spark.sql("SELECT array(array(0.0,0.0),array(1.1,2.2)) AS points_list")
> # generate a temp view and use the sql method
> df.createTempView("tbl")
> spark.sql("SELECT transform(points_list, p -> ST_Point(p[0], p[1])) AS points_list FROM tbl")
> # selectExpr
> df.selectExpr("transform(points_list, p -> ST_Point(p[0], p[1])) AS points_list")
> {code}
> I propose implementing a similar API style to Spark that works with Columns. This would allow for something like this:
> {code:python}
> import pyspark.sql.functions as f
> import sedona.sql.st_functions as stf
> df.select(f.transform(f.col("points_list"), lambda x: stf.st_point(x[0], x[1])))
> {code}
> I believe that the way that Spark implements this functionality can be mirrored to accomplish this task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)