You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2016/03/28 19:58:25 UTC
[jira] [Updated] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

     [ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Chammas updated SPARK-3533:
------------------------------------
    Description: 
Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys.

For example, say I have an RDD like this:
{code}
>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0])
>>> a.collect()
[('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
>>> a.keys().distinct().collect()
['B', 'F', 'N']
{code}

Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition.

So the output would look something like:

{code}
/path/prefix/B [/part-1, /part-2, etc]
/path/prefix/F [/part-1, /part-2, etc]
/path/prefix/N [/part-1, /part-2, etc]
{code}

Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark.

Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once.

---

Update: March 2016

There are two workarounds to this problem:

1. See [this answer on Stack Overflow|http://stackoverflow.com/a/26051042/877069], which implements {{MultipleTextOutputFormat}}. (Scala-only)
2. See [this comment by Davies Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which uses DataFrames:
    {code}
val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text")
df.write.partitionBy("key").text(path){code}


  was:
Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys.

For example, say I have an RDD like this:
{code}
>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0])
>>> a.collect()
[('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
>>> a.keys().distinct().collect()
['B', 'F', 'N']
{code}

Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition.

So the output would look something like:

{code}
/path/prefix/B [/part-1, /part-2, etc]
/path/prefix/F [/part-1, /part-2, etc]
/path/prefix/N [/part-1, /part-2, etc]
{code}

Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark.

Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once.


> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>
>                 Key: SPARK-3533
>                 URL: https://issues.apache.org/jira/browse/SPARK-3533
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once.
> ---
> Update: March 2016
> There are two workarounds to this problem:
> 1. See [this answer on Stack Overflow|http://stackoverflow.com/a/26051042/877069], which implements {{MultipleTextOutputFormat}}. (Scala-only)
> 2. See [this comment by Davies Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which uses DataFrames:
>     {code}
> val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text")
> df.write.partitionBy("key").text(path){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org