You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/04/21 07:33:58 UTC

[jira] [Assigned] (SPARK-7025) Create a Java-friendly input source API

     [ https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-7025:
-----------------------------------

    Assignee: Apache Spark  (was: Reynold Xin)

> Create a Java-friendly input source API
> ---------------------------------------
>
>                 Key: SPARK-7025
>                 URL: https://issues.apache.org/jira/browse/SPARK-7025
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Reynold Xin
>            Assignee: Apache Spark
>
> The goal of this ticket is to create a simple input source API that we can maintain and support long term.
> Spark currently has two de facto input source API:
> 1. RDD API
> 2. Hadoop MapReduce InputFormat API
> Neither of the above is ideal:
> 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come.
> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first.
> So here's the proposal: an InputSource is described by:
> * an array of InputPartition that specifies the data partitioning
> * a RecordReader that specifies how data on each partition can be read
> This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org