You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/12/10 22:52:00 UTC

[jira] [Commented] (SPARK-12196) Store/retrieve blocks in different speed storage devices by hierarchy way

    [ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715711#comment-16715711 ] 

ASF GitHub Bot commented on SPARK-12196:
----------------------------------------

vanzin commented on issue #10225: [SPARK-12196][Core] Store/retrieve blocks from different speed storage devices by hierarchical way
URL: https://github.com/apache/spark/pull/10225#issuecomment-446004810
 
 
   @yucai seems like this was forgotten. I'm going to close if for now; if you want to bring it up to date and update your branch, it will be re-opened automatically.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Store/retrieve blocks in different speed storage devices by hierarchy way
> -------------------------------------------------------------------------
>
>                 Key: SPARK-12196
>                 URL: https://issues.apache.org/jira/browse/SPARK-12196
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: yucai
>            Priority: Major
>
> *Motivation*
> Our customers want to use SSD to speed up machine learning and SQL workload, but all SSDs are quite expensive and SSD's capacity is still smaller than HDD.
> *Proposal*
> Our solution is to build tiered storage: use SSDs as cache and HDDs as backup. 
> When Spark core allocates blocks (either for shuffle or RDD cache), it stores blocks in SSDs first, and when the SSD’s free space is less than some threshold, starting to use HDDs.
> *Performance Evaluation*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance regression.
> 3. Compared with all HDDs, tiered store improves +x2 for machine learning workload and +x1.7 for SparkSQL workload.
> *Usage*
> 1. Enable tiered storage in spark-default.conf.
> {code}
> spark.diskStore.allocator      tiered
> {code}
> 2. Configure storage hierarchy, for Yarn user, see below example:
> {code}
>   <property>
>     <name>yarn.nodemanager.local-dirs</name>
>     <value>/mnt/DP_disk1/yucai/yarn/local,/mnt/DP_disk2/yucai/yarn/local,
>            /mnt/DP_disk3/yucai/yarn/local,/mnt/DP_disk4/yucai/yarn/local,
>            /mnt/DP_disk5/yucai/yarn/local,/mnt/DP_disk6/yucai/yarn/local,
>     </value>
>   </property>
>   <property>
>     <name>yarn.nodemanager.spark-dirs-tiers</name>
>     <value>001111</value>
>   </property>
> {code}
> It means DP_disk1-2 are in tier1 and DP_disk2-6 make up tier2.
>  
> *More tiers*
> In our implementation, we support to build any number tiers cross various storage medias (NVMe, SSD, HDD etc.). For example:
> {code}
>   <property>
>     <name>yarn.nodemanager.local-dirs</name>
>     <value>/mnt/DP_disk1/yucai/yarn/local,/mnt/DP_disk2/yucai/yarn/local,
>            /mnt/DP_disk3/yucai/yarn/local,/mnt/DP_disk4/yucai/yarn/local,
>            /mnt/DP_disk5/yucai/yarn/local,/mnt/DP_disk6/yucai/yarn/local,
>     </value>
>   </property>
>   <property>
>     <name>yarn.nodemanager.spark-dirs-tiers</name>
>     <value>001122</value>
>   </property>
> {code}
> It means DP_disk1-2 are in tier1, DP_disk3-4 are in tier2 and DP_disk5-6 make up tier3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org