You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeremy Bloom (Jira)" <ji...@apache.org> on 2020/06/10 23:15:00 UTC
[jira] [Commented] (SPARK-25186) Stabilize Data Source V2 API

    [ https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132780#comment-17132780 ] 

Jeremy Bloom commented on SPARK-25186:
--------------------------------------

I'm not sure whether I'm in the right place, but I want to discuss my experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm not a Spark contributor, but I am using Spark in several applications. If this is not the appropriate place for my comments, please direct me to the right place.

My current project is part of an open source program called COIN-OR (I can provide links if you like). Among other things, we are building a data exchange standard for mathematical optimization solvers, called MOSDEX. I am using Spark SQL as the backbone  of this application. I work mostly in Java, some in Python, and not at all in Scala.

I came to this point because a critical link in my application is to load data from JSON into a Spark Dataset. For various reasons, the standard JSON reader in Spark is not adequate for my purposes. I want to use a Java Stream to load the data into a Spark dataframe. It turns out that this is remarkably hard to do.

Spark Session provides a createDataFrame method whose argument is a Java List. However, because I am dealing with potentially vary large data sets, I do not want to create a list in memory. My first attempt was to wrap the stream in a "virtual list" that fulfilled most of the list "contract" but did not actually create a list object. The hang up with this approach is that it cannot count the elements in the stream unless and until the stream has been read, so the size of the list is undeterminable until that happens. Using the virtual list in the createDataFrame method some times succeeds but fails at other times when some component of Spark needs to call the size method on the list.

I raised this question on Stackoverflow, and someone suggested using DataSource V2. There isn't much out there, but I did find a couple of examples, and based on them, I was able to build a reader and a writer for Java Streams. However, there are a number of issues that arose that forced me to use some "unsavory" tricks in my code that I am uncomfortable with (they "smell" in Fowler's sense). Here are some of the issues I found.

1) The DataSource V2 still relies on the calling sequences of DataFrameReader and DataFrameWriter from V1. The reader is instantiated by the Spark Session read method while the writer is instantiated by the Dataset write method. The separation is confusing. Why not use a static factory in Dataset to instantiate the read method?

2) The actual calls to the reader and writer are contained in the DataframeReader and DataframeWriter format methods. Using "format" for the name of this method is misleading, but once you realize what it is doing, it makes even less sense. The parameter of the format method is a string, which isn't documented, but which turns out to be a class name. Then format creates a class using reflection. Use of reflection in this circumstance appears to be unnecessary and restrictive. What not make the parameter an instance of an appropriate interface, such as ReadSupport or WriteSupport?

3) Using reflection means that the reader or writer will be created by a parameterless constructor. Thus, it is not possible to pass any information to the reader and writer. In my case, I want to pass an instance of a Java Stream and a Spark schema to the reader and a Spark dataset to the writer. The only way I have found to do that is to create static fields in the reader and writer classes and set them before calling the format method to create the reader and writer objects. That's what I mean about "unsavory" programming.

4) There are other, less critical issues, but these are still annoying. Links to external files in the option method are undocumented strings, but they look like paths. Since the Path class has been part of Java just about forever, why not use it? And when reading from or writing to a file, why not use the Java InputStream and OutputStream classes, which also have been around forever?

I realize that programming style in Scala might be different than in Java, but I can't believe that the design of DataSource V2 requires these contortions even in Scala. Given the power that is available in Spark, it seems to me that basic input and output should be much better structured than it is.

Thanks for your attention on this matter.
chrome

> Stabilize Data Source V2 API 
> -----------------------------
>
>                 Key: SPARK-25186
>                 URL: https://issues.apache.org/jira/browse/SPARK-25186
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Wenchen Fan
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org