You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Niranda Perera <ni...@gmail.com> on 2015/01/13 09:51:35 UTC

create a SchemaRDD from a custom datasource

Hi,

We have a custom datasources API, which connects to various data sources
and exposes them out as a common API. We are now trying to implement the
Spark datasources API released in 1.2.0 to connect Spark for analytics.

Looking at the sources API, we figured out that we should extend a scan
class (table scan etc). While doing so, we would have to implement the
'schema' and 'buildScan' methods.

say, we can infer the schema of the underlying data and take data out as
Row elements. Is there any way we could create RDD[Row] (needed in the
buildScan method) using these Row elements?

Cheers
-- 
Niranda

Re: create a SchemaRDD from a custom datasource

Posted by Reynold Xin <rx...@databricks.com>.

If it is a small collection of them on the driver, you can just use
sc.parallelize to create an RDD.


On Tue, Jan 13, 2015 at 7:56 AM, Malith Dhanushka <mm...@gmail.com>
wrote:

> Hi Reynold,
>
> Thanks for the response. I am just wondering, lets say we have set of Row
> objects. Isn't there a straightforward way of creating RDD[Row] out of it
> without writing a custom RDD?
>
> ie - a utility method
>
> Thanks
> Malith
>
> On Tue, Jan 13, 2015 at 2:29 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Depends on what the other side is doing. You can create your own RDD
>> implementation by subclassing RDD, or it might work if you use
>> sc.parallelize(1 to n, n).mapPartitionsWithIndex( /* code to read the data
>> and return an iterator */ ) where n is the number of partitions.
>>
>> On Tue, Jan 13, 2015 at 12:51 AM, Niranda Perera <
>> niranda.perera@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We have a custom datasources API, which connects to various data sources
>>> and exposes them out as a common API. We are now trying to implement the
>>> Spark datasources API released in 1.2.0 to connect Spark for analytics.
>>>
>>> Looking at the sources API, we figured out that we should extend a scan
>>> class (table scan etc). While doing so, we would have to implement the
>>> 'schema' and 'buildScan' methods.
>>>
>>> say, we can infer the schema of the underlying data and take data out as
>>> Row elements. Is there any way we could create RDD[Row] (needed in the
>>> buildScan method) using these Row elements?
>>>
>>> Cheers
>>> --
>>> Niranda
>>>
>>
>>
> <Em...@gmail.com>
>
>

Re: create a SchemaRDD from a custom datasource

Posted by Reynold Xin <rx...@databricks.com>.

Depends on what the other side is doing. You can create your own RDD
implementation by subclassing RDD, or it might work if you use
sc.parallelize(1 to n, n).mapPartitionsWithIndex( /* code to read the data
and return an iterator */ ) where n is the number of partitions.

On Tue, Jan 13, 2015 at 12:51 AM, Niranda Perera <ni...@gmail.com>
wrote:

> Hi,
>
> We have a custom datasources API, which connects to various data sources
> and exposes them out as a common API. We are now trying to implement the
> Spark datasources API released in 1.2.0 to connect Spark for analytics.
>
> Looking at the sources API, we figured out that we should extend a scan
> class (table scan etc). While doing so, we would have to implement the
> 'schema' and 'buildScan' methods.
>
> say, we can infer the schema of the underlying data and take data out as
> Row elements. Is there any way we could create RDD[Row] (needed in the
> buildScan method) using these Row elements?
>
> Cheers
> --
> Niranda
>