You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Tim Robertson (JIRA)" <ji...@apache.org> on 2017/11/15 10:06:00 UTC
[jira] [Comment Edited] (BEAM-3192) Be able to specify the Spark Partitioner via the pipeline options

    [ https://issues.apache.org/jira/browse/BEAM-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253214#comment-16253214 ] 

Tim Robertson edited comment on BEAM-3192 at 11/15/17 10:05 AM:
----------------------------------------------------------------

A use case for this is when one has iterative algorithms requiring merging of RDDs.  There are cases when you can make significant performance improvements by being able to colocate the RDDs that will be merged.

One implementation is the maps on [GBIF.org|https://www.gbif.org] (e.g. [Animals|https://www.gbif.org/species/1], [Birds|https://www.gbif.org/species/212], [Sparrows|https://www.gbif.org/species/2492321]) which are recalculated every few hours in Spark jobs coordinated by Oozie, and persisted in HBase.  This relies on using Spark partitioning to [merge zoom levels up to world views|https://github.com/gbif/maps/blob/master/spark-process/src/main/scala/org/gbif/maps/spark/BackfillTiles.scala#L142] efficiently.  

Another use case might be building HFiles offline in Spark for [efficient loading into HBase|http://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/] which requires a {{repartitionAndSortWithinPartition}} operation.


was (Author: timrobertson100):
A use case for this is when one has iterative algorithms requiring merging of RDDs.  There are cases when you can make significant performance improvements by being able to colocate the RDDs that will be merged.

One implementation is the maps on GBIF.org (e.g. [Animals|https://www.gbif.org/species/1], [Birds|https://www.gbif.org/species/212], [Sparrows|https://www.gbif.org/species/2492321]) which are recalculated every few hours in Spark jobs coordinated by Oozie, and persisted in HBase.  This relies on using Spark partitioning to [merge zoom levels up to world views|https://github.com/gbif/maps/blob/master/spark-process/src/main/scala/org/gbif/maps/spark/BackfillTiles.scala#L142] efficiently.  

Another use case might be building HFiles offline in Spark for [efficient loading into HBase|http://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/] which requires a {{repartitionAndSortWithinPartition}} operation.

> Be able to specify the Spark Partitioner via the pipeline options
> -----------------------------------------------------------------
>
>                 Key: BEAM-3192
>                 URL: https://issues.apache.org/jira/browse/BEAM-3192
>             Project: Beam
>          Issue Type: New Feature
>          Components: runner-spark
>            Reporter: Jean-Baptiste Onofré
>            Assignee: Jean-Baptiste Onofré
>
> As we did for the StorageLevel, it would be great for an user to be able to provide the Spark partitionner via PipelineOptions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)