You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Cristian (JIRA)" <ji...@apache.org> on 2015/07/21 02:14:04 UTC

[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

    [ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634316#comment-14634316 ] 

Cristian commented on SPARK-4849:
---------------------------------

I would argue that the priority for this is not Minor since if resolved it will enable many use cases where data can be stored in memory and queried repeatedly at low latency. 

For example with Spark Streaming applications, it's common to join incoming data with a memory resident dataset for enrichment. If that join can be performed without a shuffle it would enable important use cases which are currently too high-latency to implement with Spark.

It appears this is also a fairly straightforward fix, so any chance it can get some priority ?

> Pass partitioning information (distribute by) to In-memory caching
> ------------------------------------------------------------------
>
>                 Key: SPARK-4849
>                 URL: https://issues.apache.org/jira/browse/SPARK-4849
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Nitin Goyal
>            Priority: Minor
>
> HQL "distribute by <column_name>" partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information.
> Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org