You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2017/05/02 14:22:04 UTC
[jira] [Commented] (SPARK-19582) DataFrameReader conceptually inadequate

    [ https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992985#comment-15992985 ] 

Steve Loughran commented on SPARK-19582:
----------------------------------------

All spark is doing is taking a URL To data, mapping that to an FS implementation classname and expecting that to implement the methods in `org.apache.hadoop.FileSystem` so as to provide FS-like behaviour.

Giving minio is nominally an S3 clone, sounds like there's a problem here setting up the hadoop S3a client to bind to it. I'd isolate that to the Hadoop code before going near Spark, test on Hadoop 2.8 & file bugs against Hadoop and/or minio if there are problems. AFAIK, nobody has run the Hadoop S3A [tests|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md] against minio; doing that and documenting how to configure the client would be a welcome contribution. If minio is 100% S3 compatible (c3/v4 auth + multipart PUT; encryption optional), then the S3A client should work with it...it could work as another integration test for minio.

> DataFrameReader conceptually inadequate
> ---------------------------------------
>
>                 Key: SPARK-19582
>                 URL: https://issues.apache.org/jira/browse/SPARK-19582
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.1.0
>            Reporter: James Q. Arnold
>
> DataFrameReader assumes it "understands" all data sources (local file system, object stores, jdbc, ...).  This seems limiting in the long term, imposing both development costs to accept new sources and dependency issues for existing sources (how to coordinate the XX jar for internal use vs. the XX jar used by the application).  Unless I have missed how this can be done currently, an application with an unsupported data source cannot create the required RDD for distribution.
> I recommend at least providing a text API for supplying data.  Let the application provide data as a String (or char[] or ...)---not a path, but the actual data.  Alternatively, provide interfaces or abstract classes the application could provide to let the application handle external data sources, without forcing all that complication into the Spark implementation.
> I don't have any code to submit, but JIRA seemed like to most appropriate place to raise the issue.
> Finally, if I have overlooked how this can be done with the current API, a new example would be appreciated.
> Additional detail...
> We use the minio object store, which provides an API compatible with AWS-S3.  A few configuration/parameter values differ for minio, but one can use the AWS library in the application to connect to the minio server.
> When trying to use minio objects through spark, the s3://xxx paths are intercepted by spark and handed to hadoop.  So far, I have been unable to find the right combination of configuration values and parameters to "convince" hadoop to forward the right information to work with minio.  If I could read the minio object in the application, and then hand the object contents directly to spark, I could bypass hadoop and solve the problem.  Unfortunately, the underlying spark design prevents that.  So, I see two problems.
> -  Spark seems to have taken on the responsibility of "knowing" the API details of all data sources.  This seems iffy in the long run (and is the root of my current problem).  In the long run, it seems unwise to assume that spark should understand all possible path names, protocols, etc.  Moreover, passing S3 paths to hadoop seems a little odd (why not go directly to AWS, for example).  This particular confusion about S3 shows the difficulties that are bound to occur.
> -  Second, spark appears not to have a way to bypass the path name interpretation.  At the least, spark could provide a text/blob interface, letting the application supply the data object and avoid path interpretation inside spark.  Alternatively, spark could accept a reader/stream/... to build the object, again letting the application provide the implementation of the object input.
> As I mentioned above, I might be missing something in the API that lets us work around the problem.  I'll keep looking, but the API as apparently structured seems too limiting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org