You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "panbingkun (via GitHub)" <gi...@apache.org> on 2024/03/29 08:52:11 UTC

[PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

panbingkun opened a new pull request, #45776:
URL: https://github.com/apache/spark/pull/45776

   
   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   - Add new UT.
   - Pass GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #45776:
URL: https://github.com/apache/spark/pull/45776#issuecomment-2033885914

   > scala> spark.read.json().show()
   
   when run 
   ```
   scala> spark.text.json().show()
   ```
   
   it return `empty dataset`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1546245686


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -538,7 +539,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
    * @since 2.0.0
    */
   @scala.annotation.varargs
-  def csv(paths: String*): DataFrame = format("csv").load(paths : _*)
+  def csv(paths: String*): DataFrame = {

Review Comment:
   Updated, currently supported, and `corresponding UT` has been added to each type of `***Suite`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #45776:
URL: https://github.com/apache/spark/pull/45776#issuecomment-2027488989

   > I support this direction, @panbingkun .
   
   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1546048010


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -538,7 +539,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
    * @since 2.0.0
    */
   @scala.annotation.varargs
-  def csv(paths: String*): DataFrame = format("csv").load(paths : _*)
+  def csv(paths: String*): DataFrame = {

Review Comment:
   same question here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1549230211


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -766,6 +778,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
     }
   }
 
+  /**
+   * A convenient function for paths validation in APIs.
+   */
+  private def assertNonEmptyPaths(paths: String*): Unit = {
+    if (userSpecifiedSchema.isEmpty) {
+      require(paths.nonEmpty, "The paths cannot be empty")

Review Comment:
   What about SQL?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1549250502


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -766,6 +778,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
     }
   }
 
+  /**
+   * A convenient function for paths validation in APIs.
+   */
+  private def assertNonEmptyPaths(paths: String*): Unit = {
+    if (userSpecifiedSchema.isEmpty) {
+      require(paths.nonEmpty, "The paths cannot be empty")

Review Comment:
   > I am actually still not sure on this. What about `DataFrameReader.load()`? I feel like it's better to delegate this check to DataSources themselves.
   
   `DataFrameReader.load()`
   <img width="1015" alt="image" src="https://github.com/apache/spark/assets/15246973/135a7936-7cce-4c60-8c70-280e4cc7538f">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45776:
URL: https://github.com/apache/spark/pull/45776#issuecomment-2033496439

   To be sure, gentle ping once more, @HyukjinKwon , @cloud-fan , @LuciferYang , @koedlt 
   - https://github.com/apache/spark/pull/45776#discussion_r1545520694
   - https://github.com/apache/spark/pull/45776#discussion_r1545520867


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545540101


##########
sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameWriterV2Suite.java:
##########
@@ -26,29 +32,30 @@
 import org.apache.spark.sql.connector.catalog.InMemoryTableCatalog;
 import org.apache.spark.sql.test.TestSparkSession;
 import org.apache.spark.sql.types.StructType;
-import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
+import org.apache.spark.util.Utils;
 
 import static org.apache.spark.sql.functions.*;
 
 public class JavaDataFrameWriterV2Suite {
   private static StructType schema = new StructType().add("s", "string");
   private SparkSession spark = null;
+  private transient String input;
 
   public Dataset<Row> df() {
-    return spark.read().schema(schema).text();
+    return spark.read().schema(schema).text(input);
   }
 
   @BeforeEach
   public void createTestTable() {
     this.spark = new TestSparkSession();
     spark.conf().set("spark.sql.catalog.testcat", InMemoryTableCatalog.class.getName());
     spark.sql("CREATE TABLE testcat.t (s string) USING foo");
+    input = Utils.createTempDir(System.getProperty("java.io.tmpdir"), "input").toString();

Review Comment:
   Very reasonable, but `unfortunately`, this is `Java` code and it does `not support` this writing style.
   As follows:
   <img width="709" alt="image" src="https://github.com/apache/spark/assets/15246973/ffe05290-8d96-4177-9c18-cddfa38d6470">



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545540426


##########
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala:
##########
@@ -625,15 +626,21 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSparkSession with
     spark.createDataset(data).write.mode(SaveMode.Overwrite).text(dir)
 
     // Reader, without user specified schema
-    testRead(spark.read.textFile().toDF(), Seq.empty, textSchema)
+    withTempDir { tmpDir =>

Review Comment:
   @HyukjinKwon 
   This is a test case for `Empty Dataset`. When reading an empty directory, you will get a `Empty Dataset`.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1549210594


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala:
##########
@@ -3820,6 +3820,21 @@ abstract class JsonSuite
       }
     }
   }
+

Review Comment:
   @HyukjinKwon 
   In each `***Suite`, I added `UT` to verify this issue: when `specify the schema`, will return `empty dataset.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545539985


##########
sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameWriterV2Suite.java:
##########
@@ -26,29 +32,30 @@
 import org.apache.spark.sql.connector.catalog.InMemoryTableCatalog;
 import org.apache.spark.sql.test.TestSparkSession;
 import org.apache.spark.sql.types.StructType;
-import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
+import org.apache.spark.util.Utils;
 
 import static org.apache.spark.sql.functions.*;
 
 public class JavaDataFrameWriterV2Suite {
   private static StructType schema = new StructType().add("s", "string");
   private SparkSession spark = null;
+  private transient String input;

Review Comment:
   After careful consideration, I think it would be more appropriate to call it `input` here, as it could be `a specific file` or `a directory`. Therefore, to be compatible with `the meaning` here, it seems more reasonable to call it `input` ( or "inputDir" ?),
   Additionally, because it corresponds to another test class `JavaDataFrameReaderWriterSuite`, as follows:
   https://github.com/apache/spark/blob/11d76c96554cc71c6a941c99222c08c76bd04bf2/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameReaderWriterSuite.java#L35-L36



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "koedlt (via GitHub)" <gi...@apache.org>.
koedlt commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545642868


##########
sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameWriterV2Suite.java:
##########
@@ -26,29 +32,30 @@
 import org.apache.spark.sql.connector.catalog.InMemoryTableCatalog;
 import org.apache.spark.sql.test.TestSparkSession;
 import org.apache.spark.sql.types.StructType;
-import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
+import org.apache.spark.util.Utils;
 
 import static org.apache.spark.sql.functions.*;
 
 public class JavaDataFrameWriterV2Suite {
   private static StructType schema = new StructType().add("s", "string");
   private SparkSession spark = null;
+  private transient String input;

Review Comment:
   Ahhh of course, that makes sense! Maybe `inputDir` is a bit more explicit, but since the `JavaDataFrameReaderWriterSuite` already has the same notation it's probably best to be consistent :+1: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545520694


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -538,7 +539,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
    * @since 2.0.0
    */
   @scala.annotation.varargs
-  def csv(paths: String*): DataFrame = format("csv").load(paths : _*)
+  def csv(paths: String*): DataFrame = {

Review Comment:
   Does CSV/JSON work if users explicitly specify the schema? That's more a production usecase



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #45776:
URL: https://github.com/apache/spark/pull/45776#issuecomment-2037333996

   I will continue to investigate it carefully and find a compromise solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #45776:
URL: https://github.com/apache/spark/pull/45776#issuecomment-2033846860

   > What concerns me is when you specify the schema:
   > 
   > ```scala
   > scala> spark.read.schema("value STRING").text().show()
   > 24/04/03 15:35:10 WARN DataSource: All paths were ignored:
   > 
   > +-----+
   > |value|
   > +-----+
   > +-----+
   > 
   > 
   > scala> spark.read.schema("value STRING").csv().show()
   > 24/04/03 15:35:13 WARN DataSource: All paths were ignored:
   > 
   > +-----+
   > |value|
   > +-----+
   > +-----+
   > 
   > 
   > scala> spark.read.schema("value STRING").json().show()
   > 24/04/03 15:35:16 WARN DataSource: All paths were ignored:
   > 
   > +-----+
   > |value|
   > +-----+
   > +-----+
   > 
   > 
   > scala> spark.read.schema("value STRING").orc().show()
   > 24/04/03 15:35:19 WARN DataSource: All paths were ignored:
   > 
   > +-----+
   > |value|
   > +-----+
   > +-----+
   > 
   > 
   > scala> spark.read.schema("value STRING").parquet().show()
   > 24/04/03 15:35:21 WARN DataSource: All paths were ignored:
   > 
   > +-----+
   > |value|
   > +-----+
   > +-----+
   > ```
   > 
   > It all works with returning an empty DataFrame. Suppose that you have a logic like:
   > 
   > ```scala
   > val paths: Seq[String] = ... // Get the paths from somewhere else
   > spark.read.schema(...).csv(paths)
   > ```
   > 
   > then it will be broken.
   
   
   @HyukjinKwon 
   Based on the new updates, the above has been restored to its `original state`, as follows
   ```
   Spark context Web UI available at http://172.24.144.22:4040
   Spark context available as 'sc' (master = local-cluster[2, 1, 1024], app id = app-20240403155813-0000).
   Spark session available as 'spark'.
   
   scala> spark.read.schema("value STRING").text().show()
   +-----+
   |value|
   +-----+
   +-----+
   
   
   scala> spark.read.schema("value STRING").csv().show()
   +-----+
   |value|
   +-----+
   +-----+
   
   
   scala> spark.read.schema("value STRING").json().show()
   +-----+
   |value|
   +-----+
   +-----+
   
   
   scala> spark.read.schema("value STRING").orc().show()
   +-----+
   |value|
   +-----+
   +-----+
   
   
   scala> spark.read.schema("value STRING").parquet().show()
   +-----+
   |value|
   +-----+
   +-----+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1549191065


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -538,7 +539,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
    * @since 2.0.0
    */
   @scala.annotation.varargs
-  def csv(paths: String*): DataFrame = format("csv").load(paths : _*)
+  def csv(paths: String*): DataFrame = {
+    assertNonEmptyPaths(paths: _*)

Review Comment:
   @HyukjinKwon
   This logical judgment has been added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545520867


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -538,7 +539,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
    * @since 2.0.0
    */
   @scala.annotation.varargs
-  def csv(paths: String*): DataFrame = format("csv").load(paths : _*)
+  def csv(paths: String*): DataFrame = {
+    require(paths.nonEmpty, "The paths cannot be empty")

Review Comment:
   While I agree with the intention of setting the behaviour down, I concern about behaviour changes. Empty Dataset is a concept existent, e.g., `SparkSession.emptyDataset`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #45776:
URL: https://github.com/apache/spark/pull/45776#issuecomment-2033671882

   In addition, the errors I get are:
   
   ```
   
   
   scala> spark.read.json().show()
   24/04/03 15:37:24 WARN DataSource: All paths were ignored:
   
   org.apache.spark.sql.AnalysisException: [UNABLE_TO_INFER_SCHEMA] Unable to infer schema for JSON. It must be specified manually. SQLSTATE: 42KD9
     at org.apache.spark.sql.errors.QueryCompilationErrors$.dataSchemaNotSpecifiedError(QueryCompilationErrors.scala:1581)
     at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:212)
     at scala.Option.getOrElse(Option.scala:201)
     at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:212)
     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:409)
     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:232)
     at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:214)
     at scala.Option.getOrElse(Option.scala:201)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
     at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:365)
     ... 42 elided
   
   scala> spark.read.csv().show()
   24/04/03 15:37:27 WARN DataSource: All paths were ignored:
   
   org.apache.spark.sql.AnalysisException: [UNABLE_TO_INFER_SCHEMA] Unable to infer schema for CSV. It must be specified manually. SQLSTATE: 42KD9
     at org.apache.spark.sql.errors.QueryCompilationErrors$.dataSchemaNotSpecifiedError(QueryCompilationErrors.scala:1581)
     at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:212)
     at scala.Option.getOrElse(Option.scala:201)
     at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:212)
     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:409)
     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:232)
     at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:214)
     at scala.Option.getOrElse(Option.scala:201)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
     at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:541)
     ... 42 elided
   
   scala> spark.read.orc().show()
   24/04/03 15:37:29 WARN DataSource: All paths were ignored:
   
   org.apache.spark.sql.AnalysisException: [UNABLE_TO_INFER_SCHEMA] Unable to infer schema for ORC. It must be specified manually. SQLSTATE: 42KD9
     at org.apache.spark.sql.errors.QueryCompilationErrors$.dataSchemaNotSpecifiedError(QueryCompilationErrors.scala:1581)
     at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:212)
     at scala.Option.getOrElse(Option.scala:201)
     at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:212)
     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:409)
     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:232)
     at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:214)
     at scala.Option.getOrElse(Option.scala:201)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
     at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:663)
     ... 42 elided
   
   ```
   
   and when I specify the schema, it works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1549265358


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -766,6 +778,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
     }
   }
 
+  /**
+   * A convenient function for paths validation in APIs.
+   */
+  private def assertNonEmptyPaths(paths: String*): Unit = {
+    if (userSpecifiedSchema.isEmpty) {
+      require(paths.nonEmpty, "The paths cannot be empty")

Review Comment:
   > What about SQL?
   
   About SQL, as follows:
   <img width="613" alt="image" src="https://github.com/apache/spark/assets/15246973/8c2ed6fe-16e8-4fbf-913f-716b8ef56ed0">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "koedlt (via GitHub)" <gi...@apache.org>.
koedlt commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545444043


##########
sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameWriterV2Suite.java:
##########
@@ -26,29 +32,30 @@
 import org.apache.spark.sql.connector.catalog.InMemoryTableCatalog;
 import org.apache.spark.sql.test.TestSparkSession;
 import org.apache.spark.sql.types.StructType;
-import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
+import org.apache.spark.util.Utils;
 
 import static org.apache.spark.sql.functions.*;
 
 public class JavaDataFrameWriterV2Suite {
   private static StructType schema = new StructType().add("s", "string");
   private SparkSession spark = null;
+  private transient String input;

Review Comment:
   Would `inputFile` be a bit more specific here as variable name?



##########
sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameWriterV2Suite.java:
##########
@@ -26,29 +32,30 @@
 import org.apache.spark.sql.connector.catalog.InMemoryTableCatalog;
 import org.apache.spark.sql.test.TestSparkSession;
 import org.apache.spark.sql.types.StructType;
-import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
+import org.apache.spark.util.Utils;
 
 import static org.apache.spark.sql.functions.*;
 
 public class JavaDataFrameWriterV2Suite {
   private static StructType schema = new StructType().add("s", "string");
   private SparkSession spark = null;
+  private transient String input;
 
   public Dataset<Row> df() {
-    return spark.read().schema(schema).text();
+    return spark.read().schema(schema).text(input);
   }
 
   @BeforeEach
   public void createTestTable() {
     this.spark = new TestSparkSession();
     spark.conf().set("spark.sql.catalog.testcat", InMemoryTableCatalog.class.getName());
     spark.sql("CREATE TABLE testcat.t (s string) USING foo");
+    input = Utils.createTempDir(System.getProperty("java.io.tmpdir"), "input").toString();

Review Comment:
   `Utils.createTempDir`'s first argument has as default value `System.getProperty("java.io.tmpdir")` so maybe it's not needed to supply that here.
   
   Maybe `Utils.createTempDir(namePrefix = "input").toString();` is a bit easier to read?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1550648587


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -766,6 +778,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
     }
   }
 
+  /**
+   * A convenient function for paths validation in APIs.
+   */
+  private def assertNonEmptyPaths(paths: String*): Unit = {
+    if (userSpecifiedSchema.isEmpty) {
+      require(paths.nonEmpty, "The paths cannot be empty")

Review Comment:
   Can you try to `CREATE TABLE ... USING` syntax, and select from it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45776:
URL: https://github.com/apache/spark/pull/45776#issuecomment-2036745633

   UNABLE_TO_INFER_SCHEMA error looks fine to me. We can improve the error message to say maybe it's because no path is given. I'm not sure we need to spend effort to refine the behavior around it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1549238291


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -766,6 +778,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
     }
   }
 
+  /**
+   * A convenient function for paths validation in APIs.
+   */
+  private def assertNonEmptyPaths(paths: String*): Unit = {
+    if (userSpecifiedSchema.isEmpty) {
+      require(paths.nonEmpty, "The paths cannot be empty")

Review Comment:
   Okay, I'll give it a try.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #45776:
URL: https://github.com/apache/spark/pull/45776#issuecomment-2026921175

   cc @cloud-fan @HyukjinKwon @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545540426


##########
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala:
##########
@@ -625,15 +626,21 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSparkSession with
     spark.createDataset(data).write.mode(SaveMode.Overwrite).text(dir)
 
     // Reader, without user specified schema
-    testRead(spark.read.textFile().toDF(), Seq.empty, textSchema)
+    withTempDir { tmpDir =>

Review Comment:
   @HyukjinKwon 
   This is a test case for `Empty Dataset`. When reading an empty directory, you will get a `Empty Dataset` (SparkSession.emptyDataset).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "koedlt (via GitHub)" <gi...@apache.org>.
koedlt commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1545643660


##########
sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameWriterV2Suite.java:
##########
@@ -26,29 +32,30 @@
 import org.apache.spark.sql.connector.catalog.InMemoryTableCatalog;
 import org.apache.spark.sql.test.TestSparkSession;
 import org.apache.spark.sql.types.StructType;
-import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
+import org.apache.spark.util.Utils;
 
 import static org.apache.spark.sql.functions.*;
 
 public class JavaDataFrameWriterV2Suite {
   private static StructType schema = new StructType().add("s", "string");
   private SparkSession spark = null;
+  private transient String input;
 
   public Dataset<Row> df() {
-    return spark.read().schema(schema).text();
+    return spark.read().schema(schema).text(input);
   }
 
   @BeforeEach
   public void createTestTable() {
     this.spark = new TestSparkSession();
     spark.conf().set("spark.sql.catalog.testcat", InMemoryTableCatalog.class.getName());
     spark.sql("CREATE TABLE testcat.t (s string) USING foo");
+    input = Utils.createTempDir(System.getProperty("java.io.tmpdir"), "input").toString();

Review Comment:
   Damn of course, I wasn't looking closely :+1:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1549227681


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -717,7 +726,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
    * @since 1.6.0
    */
   @scala.annotation.varargs
-  def text(paths: String*): DataFrame = format("text").load(paths : _*)
+  def text(paths: String*): DataFrame = {
+    assertNonEmptyPaths(paths: _*)
+    format("text").load(paths : _*)

Review Comment:
   ```suggestion
       format("text").load(paths: _*)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47649][SQL] Make the parameter `inputs` of the function `[csv|parquet|orc|json|text|xml](paths: String*)` non empty [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45776:
URL: https://github.com/apache/spark/pull/45776#discussion_r1549229781


##########
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -766,6 +778,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
     }
   }
 
+  /**
+   * A convenient function for paths validation in APIs.
+   */
+  private def assertNonEmptyPaths(paths: String*): Unit = {
+    if (userSpecifiedSchema.isEmpty) {
+      require(paths.nonEmpty, "The paths cannot be empty")

Review Comment:
   I am actually still not sure on this. What about `DataFrameReader.load()`? I feel like it's better to delegate this check to DataSources themselves.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org