You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sedona.apache.org by GitBox <gi...@apache.org> on 2022/09/21 17:10:39 UTC

[GitHub] [incubator-sedona] douglasdennis opened a new pull request, #693: Dataframe api

douglasdennis opened a new pull request, #693:
URL: https://github.com/apache/incubator-sedona/pull/693

   
   ## Did you read the Contributor Guide?
   
   - Yes, I have read [Contributor Rules](https://sedona.apache.org/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/community/develop/)
   
   - No, I haven't read it.
   
   ## Is this PR related to a JIRA ticket?
   
   - Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format `[SEDONA-XXX] my subject`.
   
   
   - No, this is a documentation update. The PR name follows the format `[DOCS] my subject`.
   
   
   ## What changes were proposed in this PR?
   
   
   ## How was this patch tested?
   
   
   ## Did this PR include necessary documentation updates?
   
   - Yes, I am adding a new API. I am using the [current SNAPSHOT version number](https://github.com/apache/incubator-sedona/blob/master/pom.xml#L29) in since `vX.Y.Z` format.
   - Yes, I have updated the documentation update.
   - No, this PR does not affect any public API so no need to change the docs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] Kimahriman commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

Kimahriman commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1254925542

   > @douglasdennis Another interesting finding: since the type-safe dataframe APIs do not need to call "udf.register()" to register all functions, is it possible that, as a side effect, predicate pushdown is finally supported in Sedona?
   
   I believe the current mechanism doesn't use UDFs technically, it registers SQL functions directly which _should_ theoretically be usable by predicate pushdown (if you can figure out that mechanism in V1). That's why we can do things like join detection with pattern matching on the expressions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] Imbruced commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

Imbruced commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1257004871

   Great improvement ! The main idea is to provide the type safe function and what I am missing the most is to validate input types and verify against None values. Also I would like to have test cases for that. WDYT ? Maybe in another PR because this is massive :) Thanks for your effort. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] jiayuasu commented on a diff in pull request #693: Dataframe api

Posted by GitBox <gi...@apache.org>.

jiayuasu commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r977198754


##########
sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/st_aggregates.scala:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.spark.sql.sedona_sql.expressions
+
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.sedona_sql.expressions_udaf.{ST_Envelope_Aggr => ST_Envelope_Aggr_udaf, ST_Intersection_Aggr => ST_Intersection_Aggr_udaf, ST_Union_Aggr => ST_Union_Aggr_udaf}
+
+object st_aggregates extends DataFrameAPI {

Review Comment:
   Sedona aggregator is already using type-safe aggregator: https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/AggregateFunctions.scala#L76
   
   The old UDAF-based aggregator (https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions_udaf/AggregateFunctions.scala#L27) is not used in Sedona for Spark 3.0 version: https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/sedona/sql/UDF/UdfRegistrator.scala#L44
   
   Spark 2.X support will be dropped in Sedona 1.3.0 (next release). Therefore, I don't think we need this part. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] jiayuasu commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

jiayuasu commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r977225481


##########
sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/st_aggregates.scala:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.spark.sql.sedona_sql.expressions
+
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.sedona_sql.expressions_udaf.{ST_Envelope_Aggr => ST_Envelope_Aggr_udaf, ST_Intersection_Aggr => ST_Intersection_Aggr_udaf, ST_Union_Aggr => ST_Union_Aggr_udaf}
+
+object st_aggregates extends DataFrameAPI {

Review Comment:
   UserDefinedAggregateFunction is deprecated since Spark 3.0 and Spark 3.0 suggests the type-safe aggregator: https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html#examples
   
   So Sedona for Spark 3.0 uses the type-safe aggregator. Sedona for Spark 2.0 still uses UserDefinedAggregateFunction, which will be completely dropped in Sedona 1.3.0 (next release).
   
   The URL you pasted actually uses the deprecated UserDefinedAggregateFunction as the example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r985199634


##########
python/tests/sql/test_dataframe_api.py:
##########
@@ -0,0 +1,1087 @@
+import math
+
+from pyspark.sql import functions as f
+
+from sedona.sql import st_functions as stf
+from sedona.sql import st_constructors as stc
+from sedona.sql import st_aggregates as sta
+from sedona.sql import st_predicates as stp
+from tests.test_base import TestBase
+
+class TestDataFrameAPI(TestBase):
+
+    def test_call_function_using_columns(self):

Review Comment:
   Alright. I gave this a shot. Let me know if it isn't what you were hoping for.
   
   Also: I don't know what the proper etiquette is on github. Do I click the "Resolve Conversation" button when I think I did it or do the folks requesting code changes do that when they are satisfied?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] Imbruced commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

Imbruced commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1257022088

   @douglasdennis Thanks ! Decorator sounds great ! It's good to have Python exception raised which can be easier handled and debugged later by users :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] Imbruced commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

Imbruced commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r979268764


##########
python/tests/sql/test_dataframe_api.py:
##########
@@ -0,0 +1,1087 @@
+import math
+
+from pyspark.sql import functions as f
+
+from sedona.sql import st_functions as stf
+from sedona.sql import st_constructors as stc
+from sedona.sql import st_aggregates as sta
+from sedona.sql import st_predicates as stp
+from tests.test_base import TestBase
+
+class TestDataFrameAPI(TestBase):
+
+    def test_call_function_using_columns(self):

Review Comment:
   Do you think parametrised tests can help to simplify the tests ? I mean https://docs.pytest.org/en/6.2.x/parametrize.html ? 



##########
python/tests/sql/test_dataframe_api.py:
##########
@@ -0,0 +1,1087 @@
+import math
+
+from pyspark.sql import functions as f
+
+from sedona.sql import st_functions as stf
+from sedona.sql import st_constructors as stc
+from sedona.sql import st_aggregates as sta
+from sedona.sql import st_predicates as stp
+from tests.test_base import TestBase
+
+class TestDataFrameAPI(TestBase):
+
+    def test_call_function_using_columns(self):

Review Comment:
   Then you can use argument list, function name to apply and also some fixtures to provide dataframe ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] jiayuasu commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

jiayuasu commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1258894077

   @douglasdennis Please let me know once you finish this PR :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1264553605

   > @douglasdennis Is this PR ready to merge? :-)
   
   Just working on a couple of the notes now. Will be done by the end of this weekend. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1256974746

   @jiayuasu This should be, at least structurally, complete. I still have docs and docstrings to complete. A couple questions:
   
   1. I was planning to add a new section to this [page](https://sedona.apache.org/tutorial/sql/) and this [page](https://sedona.apache.org/tutorial/sql-python/) demonstrating how to use this API. Does that seem like a good way to go about it? Or would you prefer something else?
   2. Should I add scaladoc to the Scala code?
   3. I found sphinx style docstrings in some python code. Is that the style that I should use in Python?
   
   I believe that @Imbruced is the Python reviewer, so I'm pinging them for this as well.
   
   Any comments, suggestions, or code changes needed please let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] Imbruced commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

Imbruced commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r979267953


##########
python/sedona/sql/dataframe_api.py:
##########
@@ -0,0 +1,35 @@
+from typing import Any, Tuple, Union, Iterable
+from pyspark.sql import SparkSession, Column, functions as f
+
+
+ColumnOrName = Union[Column, str]
+ColumnOrNameOrNumber = Union[Column, str, float, int]
+
+
+def _convert_argument_to_java_column(arg: Any) -> Column:
+    if isinstance(arg, Column):
+        return arg._jc
+    elif isinstance(arg, str):
+        return f.col(arg)._jc
+    elif isinstance(arg, Iterable):
+        return f.array(*[Column(x) for x in map(_convert_argument_to_java_column, arg)])._jc

Review Comment:
   Whats the advantage of using map and list comprehension in one place instead of applying two functions using list comprehension ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r977229737


##########
sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/st_aggregates.scala:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.spark.sql.sedona_sql.expressions
+
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.sedona_sql.expressions_udaf.{ST_Envelope_Aggr => ST_Envelope_Aggr_udaf, ST_Intersection_Aggr => ST_Intersection_Aggr_udaf, ST_Union_Aggr => ST_Union_Aggr_udaf}
+
+object st_aggregates extends DataFrameAPI {

Review Comment:
   I'm shaking my fist at databricks right now...
   
   This makes sense. I'll switch to using the type safe Aggregator classes and using the spark udaf function to allow it to take columns. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1254566997

   @jiayuasu Thank you :) I should also mention this is some of my first Scala I've written. So, if there is any language faux pas or better ways to accomplish what I'm doing, then let me know and I'll fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r979275270


##########
python/sedona/sql/dataframe_api.py:
##########
@@ -0,0 +1,35 @@
+from typing import Any, Tuple, Union, Iterable
+from pyspark.sql import SparkSession, Column, functions as f
+
+
+ColumnOrName = Union[Column, str]
+ColumnOrNameOrNumber = Union[Column, str, float, int]
+
+
+def _convert_argument_to_java_column(arg: Any) -> Column:
+    if isinstance(arg, Column):
+        return arg._jc
+    elif isinstance(arg, str):
+        return f.col(arg)._jc
+    elif isinstance(arg, Iterable):
+        return f.array(*[Column(x) for x in map(_convert_argument_to_java_column, arg)])._jc

Review Comment:
   No advantage. Just me being silly when I refactored. I originally wanted this to use recursion in a different way, but I wasn't feeling it so I defaulted to this out of haste :) Will refactor to use function composition. Thanks for catching that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1255376265

   > Another interesting finding: since the type-safe dataframe APIs do not need to call "udf.register()" to register all functions, is it possible that, as a side effect, predicate pushdown is finally supported in Sedona?
   
   @jiayuasu It does not appear to be so. Here are some results I ran this morning using the example1.parquet in the library.
   
   Checking to make sure predicate pushdown happens with native types:
   ```scala
   val geoparquetdatalocation1: String = resourceFolder + "geoparquet/example1.parquet"
   val basicPredicateDf = sparkSession.read.format("geoparquet").load(geoparquetdatalocation1).where(col("name").equalTo("Fiji"))
   basicPredicateDf.explain()
   ```
   The plan shows a push down:
   ```
   == Physical Plan ==
   *(1) Filter (isnotnull(name#5302) AND (name#5302 = Fiji))
   +- FileScan geoparquet [pop_est#5300L,continent#5301,name#5302,iso_a3#5303,gdp_md_est#5304,geometry#5305] Batched: false, DataFilters: [isnotnull(name#5302), (name#5302 = Fiji)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:<removed>, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Fiji)], ReadSchema: struct<pop_est:bigint,continent:string,name:string,iso_a3:string,gdp_md_est:double,geometry:array...
   ```
   \
   \
   A simple geometry based predicate:
   ```scala
   val geoparquetdatalocation1: String = resourceFolder + "geoparquet/example1.parquet"
   val basicGeomPredicateDf = sparkSession.read.format("geoparquet").load(geoparquetdatalocation1).where(ST_GeometryType("geometry").equalTo(lit("ST_Polygon")))
   basicGeomPredicateDf.explain()
   ```
   This plan does not show a push down:
   ```
   == Physical Plan ==
   Filter (st_geometrytype(geometry#5318) = ST_Polygon)
   +- FileScan geoparquet [pop_est#5313L,continent#5314,name#5315,iso_a3#5316,gdp_md_est#5317,geometry#5318] Batched: false, DataFilters: [(st_geometrytype(geometry#5318) = ST_Polygon)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:<removed>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<pop_est:bigint,continent:string,name:string,iso_a3:string,gdp_md_est:double,geometry:array...
   ```
   \
   \
   And just to be thorough, a more complex geometry based predicate:
   ```scala
   val geoparquetdatalocation1: String = resourceFolder + "geoparquet/example1.parquet"
   val basicGeomPredicateDf = sparkSession.read.format("geoparquet").load(geoparquetdatalocation1).where(ST_Distance("geometry", ST_Point(2, 2)) <= (50.0))
   basicGeomPredicateDf.explain()
   ```
   As expected, no push down as well:
   ```
   == Physical Plan ==
   Filter ( **org.apache.spark.sql.sedona_sql.expressions.ST_Distance**   <= 50.0)
   +- FileScan geoparquet [pop_est#5326L,continent#5327,name#5328,iso_a3#5329,gdp_md_est#5330,geometry#5331] Batched: false, DataFilters: [( **org.apache.spark.sql.sedona_sql.expressions.ST_Distance**   <= 50.0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:<removed>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<pop_est:bigint,continent:string,name:string,iso_a3:string,gdp_md_est:double,geometry:array...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] jiayuasu commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

jiayuasu commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1254609739

   @douglasdennis Another interesting finding: since the type-safe dataframe APIs do not need to call "udf.register()" to register all functions, is it possible that, as a side effect, predicate pushdown is finally supported in Sedona?
   
   If you don't know what a predicate pushdown is, see this: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html
   
   Since Sedona 1.3.0 will natively read GeoParquet (PR already merged to the master), Sedona user  RJ Marcus is working hard to get the predicate pushdown work on GeoParquet. See [SEDONA-156](https://issues.apache.org/jira/browse/SEDONA-156).
   
   Quoted my reply to RJ Marcus:
   
   > Pushed filter: UDF function can be pushed down as a filter. Based on my understanding, Sedona ST functions cannot be pushed down because UDFs in pure Spark SQL are blackbox to Spark catalyst unless we do something with the current Sedona ST functions.
   
   > However, Sedona implements all functions in Spark SQL Catalyst "Expressions" [1] instead of the naive UDF. This gives you the possibility to push them down to the data source (see [2]). There is an ongoing effort to enable Sedona ST functions in type-safe format which bypasses the "udf.register" step (see [3])
   
   > So, with the current Sedona GeoParquet reader, and [3], it is possible that the Pushed filter will be finally supported. You might want to check it out and confirm my wild guess.
   
   > [1] https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Constructors.scala#L45
   
   > [2] https://neapowers.com/apache-spark/native-functions-catalyst-expressions/
   
   > [3] https://github.com/apache/incubator-sedona/pull/693
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] Imbruced commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

Imbruced commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r979268015


##########
python/sedona/sql/st_constructors.py:
##########
@@ -0,0 +1,79 @@
+from functools import partial
+from typing import Optional, Union
+
+from pyspark.sql import Column
+
+from sedona.sql.dataframe_api import ColumnOrName, ColumnOrNameOrNumber, call_sedona_function
+
+
+__all__ = [
+    "ST_GeomFromGeoHash",
+    "ST_GeomFromGeoJSON",
+    "ST_GeomFromGML",
+    "ST_GeomFromKML",
+    "ST_GeomFromText",
+    "ST_GeomFromWKB",
+    "ST_GeomFromWKT",
+    "ST_LineFromText",
+    "ST_LineStringFromText",
+    "ST_Point",
+    "ST_PointFromText",
+    "ST_PolygonFromEnvelope",
+    "ST_PolygonFromText",
+]
+
+
+_call_constructor_function = partial(call_sedona_function, "st_constructors")

Review Comment:
   I like the idea of using partial :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r985200070


##########
python/sedona/sql/dataframe_api.py:
##########
@@ -0,0 +1,35 @@
+from typing import Any, Tuple, Union, Iterable
+from pyspark.sql import SparkSession, Column, functions as f
+
+
+ColumnOrName = Union[Column, str]
+ColumnOrNameOrNumber = Union[Column, str, float, int]
+
+
+def _convert_argument_to_java_column(arg: Any) -> Column:
+    if isinstance(arg, Column):
+        return arg._jc
+    elif isinstance(arg, str):
+        return f.col(arg)._jc
+    elif isinstance(arg, Iterable):
+        return f.array(*[Column(x) for x in map(_convert_argument_to_java_column, arg)])._jc

Review Comment:
   This is fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] jiayuasu merged pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

jiayuasu merged PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1264672789

   Victory!
   
   @jiayuasu  As best I know this is ready to go, assuming I have addressed the notes from @Imbruced . I can also do a follow up PR after this merges if that would be better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] jiayuasu commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

jiayuasu commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1264706257

   This looks good to me. I will merge it for now as a few other PRs are waiting for this. If @Imbruced has any comments, @douglasdennis can make a follow-up PR to address it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1257018535

   > Great improvement ! The main idea is to provide the type safe function and what I am missing the most is to validate input types and verify against None values. Also I would like to have test cases for that. WDYT ? Maybe in another PR because this is massive :) Thanks for your effort. 
   
   Doh! I had meant to have input validation and I completely forgot about it. I'd like to add those in this PR just for completion. The jvm call will throw if it can't find a method that accepts the given argument types but that would be cryptic to the user. For reference, I intend to use a decorator to manage the type checking. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] neontty commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

neontty commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1255397195

   @douglasdennis this looks like an awesome PR, thank you for putting in this hard work. 
   
   I agree that this PR won't shortcut to geometry based predicate pushdowns; the issue comes with handling Expressions/Predicates correctly during the FileScan since a lot of parquet code deals with "Filter" instead of Expression or Predicate. Thanks Dennis for doing this most recent test to confirm. 
   
   I'm a little bit busy today but I'm taking a look at Jia's comments on SEDONA-156 and making a response to continue the predicate-pushdown conversation in that thread. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1254536996

   This is pretty embarrassing. This was meant to PR against my own fork's master to get the build action to run. While most of these changes are approaching done (on the Scala side anyways), it was still subject to change as I'm sorting out the python and Java interface. The cat is out of the bag though so please consider this a draft for the moment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] jiayuasu commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

jiayuasu commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1254539535

   @douglasdennis Haha, no problem. This PR looks pretty nice. Keep up the good work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r977222536


##########
sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/st_aggregates.scala:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.spark.sql.sedona_sql.expressions
+
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.sedona_sql.expressions_udaf.{ST_Envelope_Aggr => ST_Envelope_Aggr_udaf, ST_Intersection_Aggr => ST_Intersection_Aggr_udaf, ST_Union_Aggr => ST_Union_Aggr_udaf}
+
+object st_aggregates extends DataFrameAPI {

Review Comment:
   Are you asking if the st_aggregates object is needed? If so then I think it does. This allows for something like this:
   ```
   df.groupBy().agg(ST_Union_Aggr("geometry").as("geometry"))
   ```
   If that's not what you meant then apologies.
   
   I didn't realize that UserDefinedAggregateFunction was depricated though. I was actually using this tutorial for reference to understand how this works on the Scala side: https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
   
   It looks like from what else I've read that I'm supposed to pass an Aggregator to the udaf function, so I'll plan to do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] douglasdennis commented on a diff in pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

douglasdennis commented on code in PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#discussion_r979275754


##########
python/tests/sql/test_dataframe_api.py:
##########
@@ -0,0 +1,1087 @@
+import math
+
+from pyspark.sql import functions as f
+
+from sedona.sql import st_functions as stf
+from sedona.sql import st_constructors as stc
+from sedona.sql import st_aggregates as sta
+from sedona.sql import st_predicates as stp
+from tests.test_base import TestBase
+
+class TestDataFrameAPI(TestBase):
+
+    def test_call_function_using_columns(self):

Review Comment:
   Ooohhhh! I have not used parameterized tests before. I'm excited to give them a shot. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sedona] jiayuasu commented on pull request #693: [SEDONA-166] Support type-safe dataframe style API

Posted by GitBox <gi...@apache.org>.

jiayuasu commented on PR #693:
URL: https://github.com/apache/incubator-sedona/pull/693#issuecomment-1264545797

   @douglasdennis Is this PR ready to merge? :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sedona.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org