You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by nitin <ni...@gmail.com> on 2014/12/04 11:00:39 UTC

SchemaRDD partition on specific column values?

Hi All,

I want to hash partition (and then cache) a schema RDD in way that
partitions are based on hash of the values of a  column ("ID" column in my
case). 

e.g. if my table has "ID" column with values as 1,2,3,4,5,6,7,8,9 and
spark.sql.shuffle.partitions is configured as 3, then there should be 3
partitions and say for ID=1, all the tuples should be present in one
particular partition.

My actual use case is that I always get a query in which I have to join 2
cached tables on ID column, so it first partitions both tables on ID and
then apply JOIN and I want to avoid the partitioning based on ID by
preprocessing it (and then cache it).

Thanks in Advance



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: SchemaRDD partition on specific column values?

Posted by Nitin Goyal <ni...@gmail.com>.
Hi Michael,

I have opened following JIRA for the same :-

https://issues.apache.org/jira/browse/SPARK-4849

I am having a look at the code to see what can be done and then we can have
a discussion over the approach.

Let me know if you have any comments/suggestions.

Thanks
-Nitin

On Sun, Dec 14, 2014 at 2:53 PM, Michael Armbrust <mi...@databricks.com>
wrote:
>
> I'm happy to discuss what it would take to make sure we can propagate this
> information correctly.  Please open a JIRA (and mention me in it).
>
> Regarding including it in 1.2.1, it depends on how invasive the change
> ends up being, but it is certainly possible.
>
> On Thu, Dec 11, 2014 at 3:55 AM, nitin <ni...@gmail.com> wrote:
>>
>> Can we take this as a performance improvement task in Spark-1.2.1? I can
>> help
>> contribute for this.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

-- 
Regards
Nitin Goyal

Re: SchemaRDD partition on specific column values?

Posted by Michael Armbrust <mi...@databricks.com>.
I'm happy to discuss what it would take to make sure we can propagate this
information correctly.  Please open a JIRA (and mention me in it).

Regarding including it in 1.2.1, it depends on how invasive the change ends
up being, but it is certainly possible.

On Thu, Dec 11, 2014 at 3:55 AM, nitin <ni...@gmail.com> wrote:
>
> Can we take this as a performance improvement task in Spark-1.2.1? I can
> help
> contribute for this.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: SchemaRDD partition on specific column values?

Posted by nitin <ni...@gmail.com>.
Can we take this as a performance improvement task in Spark-1.2.1? I can help
contribute for this.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: SchemaRDD partition on specific column values?

Posted by Michael Armbrust <mi...@databricks.com>.
It does not appear that the in-memory caching currently preserves the
information about the partitioning of the data so this optimization will
probably not work.

On Thu, Dec 4, 2014 at 8:42 PM, nitin <ni...@gmail.com> wrote:

> With some quick googling, I learnt that I can we can provide "distribute by
> <coulmn_name>" in hive ql to distribute data based on a column values. My
> question now if I use "distribute by id", will there be any performance
> improvements? Will I be able to avoid data movement in shuffle(Excahnge
> before JOIN step) and improve overall performance?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20424.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: SchemaRDD partition on specific column values?

Posted by nitin <ni...@gmail.com>.
With some quick googling, I learnt that I can we can provide "distribute by
<coulmn_name>" in hive ql to distribute data based on a column values. My
question now if I use "distribute by id", will there be any performance
improvements? Will I be able to avoid data movement in shuffle(Excahnge
before JOIN step) and improve overall performance?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20424.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org