You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Rajendran Appavu <ap...@in.ibm.com> on 2014/08/22 11:21:37 UTC

Adding support for a new object store

I am new to Spark source code and looking to see if i can add push-down support of spark filters to the storage (in my
case an object store). I am willing to consider how this can be generically done for any store that we might want to
integrate with spark. I am looking to know the areas that I should look into to provide support for a new data store in
this context. Following below are some of the questions I have to start with:

1. Do we need to create a new RDD class for the new store that we want to support? From Spark Context, we create an RDD
and the operations on data including the filter are performed through the RDD methods.

2. When we specify the code for filter task in the RDD.filter() method, how does it get communicated to the Executor on
the data node? Does the Executor need to compile this code on the fly and execute it? or how does it work? ( I have
looked at the code for sometime, but not yet got to figuring this out, so i am looking for some pointers that can help me
come a little up-to-speed in this part of the code)

3. How long the Executor holds the memory? and how does it decide when to release the memory/cache?

Thank you in advance.

Regards,
Rajendran.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Adding support for a new object store

Posted by Reynold Xin <rx...@databricks.com>.

Linking to the JIRA tracking APIs to hook into the planner:
https://issues.apache.org/jira/browse/SPARK-3248




On Wed, Aug 27, 2014 at 1:56 PM, Reynold Xin <rx...@databricks.com> wrote:

> Hi Rajendran,
>
> I'm assuming you have some concept of schema and you are intending to
> integrate with SchemaRDD instead of normal RDDs.
>
> More responses inline below.
>
>
> On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu <ap...@in.ibm.com>
> wrote:
>
>>
>>  I am new to Spark source code and looking to see if i can add push-down
>> support of spark filters to the storage (in my
>>  case an object store). I am willing to consider how this can be
>> generically done for any store that we might want to
>>  integrate with spark. I am looking to know the areas that I should look
>> into to provide support for a new data store in
>>  this context. Following below are some of the questions I have to start
>> with:
>>
>>  1. Do we need to create a new RDD class for the new store that we want
>> to support? From Spark Context, we create an RDD
>>  and the operations on data including the filter are performed through
>> the RDD methods.
>>
>
> You can create a new RDD type for a new storage system, and you can create
> a new table scan operator in sql to read.
>
>
>>  2. When we specify the code for filter task in the RDD.filter() method,
>> how does it get communicated to the Executor on
>>  the data node? Does the Executor need to compile this code on the fly
>> and execute it? or how does it work? ( I have
>>  looked at the code for sometime, but not yet got to figuring this out,
>> so i am looking for some pointers that can help me
>>  come a little up-to-speed in this part of the code)
>>
>
> Right now the best way to do this is to hack the sql strategies, which
> does some predicate pushdown into table scan:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
>
> We are in the process of proposing an API that allows external data stores
> to hook into the planner. Expect a design proposal in early/mid Sept.
>
> Once that is in place, you wouldn't need to hack the planner anymore. It
> is a good idea to start prototyping by hacking the planner, and migrate to
> the planner hook API once that is ready.
>
>
>>
>>  3. How long the Executor holds the memory? and how does it decide when
>> to release the memory/cache?
>>
>
> Executors by default actually don't hold any data in memory. Spark
> requires explicit caching of data, i.e. it's only when rdd.cache() is
> called then will Spark executors put the content of that RDD in-memory. The
> executor has a thing called BlockManager that does eviction based on LRU.
>
>
>
>>
>>  Thank you in advance.
>>
>>
>>
>>
>>
>> Regards,
>> Rajendran.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: Adding support for a new object store

Posted by Reynold Xin <rx...@databricks.com>.

Hi Rajendran,

I'm assuming you have some concept of schema and you are intending to
integrate with SchemaRDD instead of normal RDDs.

More responses inline below.

On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu <ap...@in.ibm.com>
wrote:

>
>  I am new to Spark source code and looking to see if i can add push-down
> support of spark filters to the storage (in my
>  case an object store). I am willing to consider how this can be
> generically done for any store that we might want to
>  integrate with spark. I am looking to know the areas that I should look
> into to provide support for a new data store in
>  this context. Following below are some of the questions I have to start
> with:
>
>  1. Do we need to create a new RDD class for the new store that we want to
> support? From Spark Context, we create an RDD
>  and the operations on data including the filter are performed through the
> RDD methods.
>

You can create a new RDD type for a new storage system, and you can create
a new table scan operator in sql to read.

>  2. When we specify the code for filter task in the RDD.filter() method,
> how does it get communicated to the Executor on
>  the data node? Does the Executor need to compile this code on the fly and
> execute it? or how does it work? ( I have
>  looked at the code for sometime, but not yet got to figuring this out, so
> i am looking for some pointers that can help me
>  come a little up-to-speed in this part of the code)
>

Right now the best way to do this is to hack the sql strategies, which does
some predicate pushdown into table scan:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

We are in the process of proposing an API that allows external data stores
to hook into the planner. Expect a design proposal in early/mid Sept.

Once that is in place, you wouldn't need to hack the planner anymore. It is
a good idea to start prototyping by hacking the planner, and migrate to the
planner hook API once that is ready.

>
>  3. How long the Executor holds the memory? and how does it decide when to
> release the memory/cache?
>

Executors by default actually don't hold any data in memory. Spark requires
explicit caching of data, i.e. it's only when rdd.cache() is called then
will Spark executors put the content of that RDD in-memory. The executor
has a thing called BlockManager that does eviction based on LRU.

>
>  Thank you in advance.
>
>
>
>
>
> Regards,
> Rajendran.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>