You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by 诺铁 <no...@gmail.com> on 2015/06/25 08:25:36 UTC

how to implement my own datasource?

hi,

I can't find documentation about datasource api,  how to implement custom
datasource.  any hint is appreciated.    thanks.

Re: how to implement my own datasource?

Posted by 诺铁 <no...@gmail.com>.
thank you guys, I'll read examples and give a try.

On Fri, Jun 26, 2015 at 2:47 AM, jimfcarroll <ji...@gmail.com> wrote:

>
> I'm not sure if this is what you're looking for but we have several custom
> RDD implementations for internal data format/partitioning schemes.
>
> The Spark api is really simple and consists primarily of being able to
> implement 3 simple things:
>
> 1) You need a class that extends RDD that's lightweight because it needs to
> be Serializable to machines on a cluster (therefore it shouldn't actually
> contain the data, for example).
> 2) That class needs to implement getPartitions() to generate an array of
> serializable Partition instances.
> 3) That class needs to implement compute(Partition p, TaskContext t) which
> will (likely) be executed on a deserialized copy of your RDD class and
> provided a deserialized instance of one of the partitions returned from
> getPartitions() and needs to return an Iterator for the actual data within
> the partition.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/how-to-implement-my-own-datasource-tp12881p12902.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: how to implement my own datasource?

Posted by jimfcarroll <ji...@gmail.com>.
I'm not sure if this is what you're looking for but we have several custom
RDD implementations for internal data format/partitioning schemes.

The Spark api is really simple and consists primarily of being able to
implement 3 simple things:

1) You need a class that extends RDD that's lightweight because it needs to
be Serializable to machines on a cluster (therefore it shouldn't actually
contain the data, for example).
2) That class needs to implement getPartitions() to generate an array of
serializable Partition instances.
3) That class needs to implement compute(Partition p, TaskContext t) which
will (likely) be executed on a deserialized copy of your RDD class and
provided a deserialized instance of one of the partitions returned from
getPartitions() and needs to return an Iterator for the actual data within
the partition.





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/how-to-implement-my-own-datasource-tp12881p12902.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: how to implement my own datasource?

Posted by Michael Armbrust <mi...@databricks.com>.
I'd suggest looking at the avro data source as an example implementation:

https://github.com/databricks/spark-avro

I also gave a talk a while ago: https://www.youtube.com/watch?v=GQSNJAzxOr8
Hi,

You can connect to by JDBC as described in
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases.
Other option is using HadoopRDD and NewHadoopRDD to connect to databases
compatible with Hadoop, like HBase, some examples can be found at chapter 5
of "Learning Spark"
https://books.google.es/books?id=tOptBgAAQBAJ&pg=PT190&dq=learning+spark+hadooprdd&hl=en&sa=X&ei=4bqLVaDaLsXaU46NgfgL&ved=0CCoQ6AEwAA#v=onepage&q=%20hadooprdd&f=false
For Spark Streaming see the section "Custom Sources" of
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Hope that helps.

Greetings,

Juan

2015-06-25 8:25 GMT+02:00 诺铁 <no...@gmail.com>:

> hi,
>
> I can't find documentation about datasource api,  how to implement custom
> datasource.  any hint is appreciated.    thanks.
>

Re: how to implement my own datasource?

Posted by Juan Rodríguez Hortalá <ju...@gmail.com>.
Hi,

You can connect to by JDBC as described in
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases.
Other option is using HadoopRDD and NewHadoopRDD to connect to databases
compatible with Hadoop, like HBase, some examples can be found at chapter 5
of "Learning Spark"
https://books.google.es/books?id=tOptBgAAQBAJ&pg=PT190&dq=learning+spark+hadooprdd&hl=en&sa=X&ei=4bqLVaDaLsXaU46NgfgL&ved=0CCoQ6AEwAA#v=onepage&q=%20hadooprdd&f=false
For Spark Streaming see the section "Custom Sources" of
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Hope that helps.

Greetings,

Juan

2015-06-25 8:25 GMT+02:00 诺铁 <no...@gmail.com>:

> hi,
>
> I can't find documentation about datasource api,  how to implement custom
> datasource.  any hint is appreciated.    thanks.
>