You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Amit Kumar <ku...@gmail.com> on 2014/06/03 23:56:29 UTC

RDD with a Map

Hi Folks,

I am new to spark -and this is probably a basic question.

I have a file on the hdfs

1, one
1, uno
2, two
2, dos

I want to create a multi Map RDD  RDD[Map[String,List[String]]]

{"1"->["one","uno"], "2"->["two","dos"]}


First I read the file
val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()

val identityDataList:RDD[List[String]]=
      identityData.map{ line =>
        val splits= line.split(",")
        splits.toList
    }

Then I group them by the first element

 val grouped:RDD[(String,Iterable[List[String]])]=
    songArtistDataList.groupBy{
      element =>{
        element(0)
      }
    }

Then I do the equivalent of mapValues of scala collections to get rid of
the first element

 val groupedWithValues:RDD[(String,List[String])] =
    grouped.flatMap[(String,List[String])]{ case (key,list)=>{
      List((key,list.map{element => {
        element(1)
      }}.toList))
    }
    }

for this to actually materialize I do collect

 val groupedAndCollected=groupedWithValues.collect()

I get an Array[String,List[String]].

I am trying to figure out if there is a way for me to get
Map[String,List[String]] (a multimap), or to create an
RDD[Map[String,List[String]] ]


I am sure there is something simpler, I would appreciate advice.

Many thanks,
Amit

Re: RDD with a Map

Posted by Ian O'Connell <ia...@ianoconnell.com>.
So if your data can be kept in memory on the driver node then you don't
really need spark? If you want to use it for hadoop reading then i'd
immediately call collect after you open it and then you can do normal scala
collections operations.


On Tue, Jun 3, 2014 at 2:56 PM, Amit Kumar <ku...@gmail.com> wrote:

> Hi Folks,
>
> I am new to spark -and this is probably a basic question.
>
> I have a file on the hdfs
>
> 1, one
> 1, uno
> 2, two
> 2, dos
>
> I want to create a multi Map RDD  RDD[Map[String,List[String]]]
>
> {"1"->["one","uno"], "2"->["two","dos"]}
>
>
> First I read the file
> val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()
>
> val identityDataList:RDD[List[String]]=
>        identityData.map{ line =>
>         val splits= line.split(",")
>         splits.toList
>     }
>
> Then I group them by the first element
>
>  val grouped:RDD[(String,Iterable[List[String]])]=
>     songArtistDataList.groupBy{
>       element =>{
>         element(0)
>       }
>     }
>
> Then I do the equivalent of mapValues of scala collections to get rid of
> the first element
>
>  val groupedWithValues:RDD[(String,List[String])] =
>     grouped.flatMap[(String,List[String])]{ case (key,list)=>{
>       List((key,list.map{element => {
>         element(1)
>       }}.toList))
>     }
>     }
>
> for this to actually materialize I do collect
>
>  val groupedAndCollected=groupedWithValues.collect()
>
> I get an Array[String,List[String]].
>
> I am trying to figure out if there is a way for me to get
> Map[String,List[String]] (a multimap), or to create an
> RDD[Map[String,List[String]] ]
>
>
> I am sure there is something simpler, I would appreciate advice.
>
> Many thanks,
> Amit
>
>
>
>
>
>
>
>
>
>

Re: RDD with a Map

Posted by Amit <ku...@gmail.com>.
Yes, RDD as a map of String keys and List of string as values.

Amit

On Jun 4, 2014, at 2:46, Oleg Proudnikov <ol...@gmail.com> wrote:

> Just a thought... Are you trying to use use the RDD as a Map?
> 
> 
> 
> On 3 June 2014 23:14, Doris Xin <do...@gmail.com> wrote:
> Hey Amit,
> 
> You might want to check out PairRDDFunctions. For your use case in particular, you can load the file as a RDD[(String, String)] and then use the groupByKey() function in PairRDDFunctions to get an RDD[(String, Iterable[String])].
> 
> Doris
> 
> 
> On Tue, Jun 3, 2014 at 2:56 PM, Amit Kumar <ku...@gmail.com> wrote:
> Hi Folks,
> 
> I am new to spark -and this is probably a basic question.
> 
> I have a file on the hdfs
> 
> 1, one
> 1, uno
> 2, two
> 2, dos
> 
> I want to create a multi Map RDD  RDD[Map[String,List[String]]]
> 
> {"1"->["one","uno"], "2"->["two","dos"]}
> 
> 
> First I read the file 
> val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()
> 
> val identityDataList:RDD[List[String]]=
>       identityData.map{ line =>
>         val splits= line.split(",")
>         splits.toList
>     }
> 
> Then I group them by the first element
> 
>  val grouped:RDD[(String,Iterable[List[String]])]=
>     songArtistDataList.groupBy{
>       element =>{
>         element(0)
>       }
>     }
> 
> Then I do the equivalent of mapValues of scala collections to get rid of the first element
> 
>  val groupedWithValues:RDD[(String,List[String])] =
>     grouped.flatMap[(String,List[String])]{ case (key,list)=>{
>       List((key,list.map{element => {
>         element(1)
>       }}.toList))
>     }
>     }
> 
> for this to actually materialize I do collect
> 
>  val groupedAndCollected=groupedWithValues.collect()
> 
> I get an Array[String,List[String]].
> 
> I am trying to figure out if there is a way for me to get Map[String,List[String]] (a multimap), or to create an RDD[Map[String,List[String]] ]
> 
> 
> I am sure there is something simpler, I would appreciate advice.
> 
> Many thanks,
> Amit
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Kind regards,
> 
> Oleg
> 

Re: RDD with a Map

Posted by Oleg Proudnikov <ol...@gmail.com>.
Just a thought... Are you trying to use use the RDD as a Map?



On 3 June 2014 23:14, Doris Xin <do...@gmail.com> wrote:

> Hey Amit,
>
> You might want to check out PairRDDFunctions
> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions>.
> For your use case in particular, you can load the file as a RDD[(String,
> String)] and then use the groupByKey() function in PairRDDFunctions to get
> an RDD[(String, Iterable[String])].
>
> Doris
>
>
> On Tue, Jun 3, 2014 at 2:56 PM, Amit Kumar <ku...@gmail.com> wrote:
>
>> Hi Folks,
>>
>> I am new to spark -and this is probably a basic question.
>>
>> I have a file on the hdfs
>>
>> 1, one
>> 1, uno
>> 2, two
>> 2, dos
>>
>> I want to create a multi Map RDD  RDD[Map[String,List[String]]]
>>
>> {"1"->["one","uno"], "2"->["two","dos"]}
>>
>>
>> First I read the file
>> val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()
>>
>> val identityDataList:RDD[List[String]]=
>>        identityData.map{ line =>
>>         val splits= line.split(",")
>>         splits.toList
>>     }
>>
>> Then I group them by the first element
>>
>>  val grouped:RDD[(String,Iterable[List[String]])]=
>>     songArtistDataList.groupBy{
>>       element =>{
>>         element(0)
>>       }
>>     }
>>
>> Then I do the equivalent of mapValues of scala collections to get rid of
>> the first element
>>
>>  val groupedWithValues:RDD[(String,List[String])] =
>>     grouped.flatMap[(String,List[String])]{ case (key,list)=>{
>>       List((key,list.map{element => {
>>         element(1)
>>       }}.toList))
>>     }
>>     }
>>
>> for this to actually materialize I do collect
>>
>>  val groupedAndCollected=groupedWithValues.collect()
>>
>> I get an Array[String,List[String]].
>>
>> I am trying to figure out if there is a way for me to get
>> Map[String,List[String]] (a multimap), or to create an
>> RDD[Map[String,List[String]] ]
>>
>>
>> I am sure there is something simpler, I would appreciate advice.
>>
>> Many thanks,
>> Amit
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>


-- 
Kind regards,

Oleg

Re: RDD with a Map

Posted by Doris Xin <do...@gmail.com>.
Hey Amit,

You might want to check out PairRDDFunctions
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions>.
For your use case in particular, you can load the file as a RDD[(String,
String)] and then use the groupByKey() function in PairRDDFunctions to get
an RDD[(String, Iterable[String])].

Doris


On Tue, Jun 3, 2014 at 2:56 PM, Amit Kumar <ku...@gmail.com> wrote:

> Hi Folks,
>
> I am new to spark -and this is probably a basic question.
>
> I have a file on the hdfs
>
> 1, one
> 1, uno
> 2, two
> 2, dos
>
> I want to create a multi Map RDD  RDD[Map[String,List[String]]]
>
> {"1"->["one","uno"], "2"->["two","dos"]}
>
>
> First I read the file
> val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()
>
> val identityDataList:RDD[List[String]]=
>       identityData.map{ line =>
>         val splits= line.split(",")
>         splits.toList
>     }
>
> Then I group them by the first element
>
>  val grouped:RDD[(String,Iterable[List[String]])]=
>     songArtistDataList.groupBy{
>       element =>{
>         element(0)
>       }
>     }
>
> Then I do the equivalent of mapValues of scala collections to get rid of
> the first element
>
>  val groupedWithValues:RDD[(String,List[String])] =
>     grouped.flatMap[(String,List[String])]{ case (key,list)=>{
>       List((key,list.map{element => {
>         element(1)
>       }}.toList))
>     }
>     }
>
> for this to actually materialize I do collect
>
>  val groupedAndCollected=groupedWithValues.collect()
>
> I get an Array[String,List[String]].
>
> I am trying to figure out if there is a way for me to get
> Map[String,List[String]] (a multimap), or to create an
> RDD[Map[String,List[String]] ]
>
>
> I am sure there is something simpler, I would appreciate advice.
>
> Many thanks,
> Amit
>
>
>
>
>
>
>
>
>
>

Re: RDD with a Map

Posted by Amit <ku...@gmail.com>.
Thanks folks. I was trying to get the RDD[multimap] so the collectAsMap is what I needed.

Best,
Amit

On Jun 4, 2014, at 6:53, Cheng Lian <li...@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 5:56 AM, Amit Kumar <ku...@gmail.com> wrote:
> 
> Hi Folks,
> 
> I am new to spark -and this is probably a basic question.
> 
> I have a file on the hdfs
> 
> 1, one
> 1, uno
> 2, two
> 2, dos
> 
> I want to create a multi Map RDD  RDD[Map[String,List[String]]]
> 
> {"1"->["one","uno"], "2"->["two","dos"]}
> Actually what you described is not a “multi-map RDD”, the type of this RDD should be something like RDD[(String, List[String]]. RDD[Map[String, List[String]]] indicates that each element within this RDD is itself a Map[String, List[String]], and I don’t think this is what you want according to the context.
> 
> 
> 
> First I read the file 
> val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()
> You don’t need to call .cache() here since identityData is used only once, so the cached data won’t be used anywhere.
> 
> 
> val identityDataList:RDD[List[String]]=
>       identityData.map{ line =>
>         val splits= line.split(",")
>         splits.toList
>     }
> Turn the text line into a (String, String) pair would be much more useful since then you can call functions like groupByKey, which are defined in PairRDDFunctions:
> 
> val identityDataPairs: RDD[(String, String)] = identityData.map { line =>
>   val Array(key, value) = line.split(",")
>   key -> value
> }
> 
> Then I group them by the first element
> 
>  val grouped:RDD[(String,Iterable[List[String]])]=
>     songArtistDataList.groupBy{
>       element =>{
>         element(0)
>       }
>     }
> Using groupByKey on pair RDDs is more convenient as mentioned above:
> 
> val grouped: RDD[(String, Iterable[String])] = identityDataPairs.groupByKey()
> 
> Then I do the equivalent of mapValues of scala collections to get rid of the first element
> 
>  val groupedWithValues:RDD[(String,List[String])] =
>     grouped.flatMap[(String,List[String])]{ case (key,list)=>{
>       List((key,list.map{element => {
>         element(1)
>       }}.toList))
>     }
>     }
> 
> for this to actually materialize I do collect
> 
>  val groupedAndCollected=groupedWithValues.collect()
> 
> I get an Array[String,List[String]].
> 
> I am trying to figure out if there is a way for me to get Map[String,List[String]] (a multimap), or to create an RDD[Map[String,List[String]] ]
> To get a Map[String, Iterable[String]], you may simply call collectAsMap which is only defined on pair RDDs:
> 
> val groupedAndCollected = grouped.collectAsMap()
> 
> 
> I am sure there is something simpler, I would appreciate advice.
> 
> Many thanks,
> Amit
> 
> At last, be careful if you are processing large volume of data, since groupByKey is an expensive transformation, and collecting all the data to driver side may simply cause OOM if the data can’t fit in the driver node.
> 
> 
> 
> Best
> Cheng
> 
> ​
> XN0LCBi ZSBjYXJlZnVsIGlmIHlvdSBhcmUgcHJvY2Vzc2luZyBsYXJnZSB2b2x1bWUgb2YgZGF0YSwgc2lu Y2UgYGdyb3VwQnlLZXlgIGlzIGFuIGV4cGVuc2l2ZSB0cmFuc2Zvcm1hdGlvbiwgYW5kIGNvbGxl Y3RpbmcgYWxsIHRoZSBkYXRhIHRvIGRyaXZlciBzaWRlIG1heSBzaW1wbHkgY2F1c2UgT09NIGlm IHRoZSBkYXRhIGNhbid0IGZpdCBpbiB0aGUgZHJpdmVyIG5vZGUuPC9kaXY+PC9kaXY+PGJyPjwv ZGl2Pg==" style="height:0;font-size:0em;padding:0;margin:0">​

Re: RDD with a Map

Posted by Cheng Lian <li...@gmail.com>.
On Wed, Jun 4, 2014 at 5:56 AM, Amit Kumar <ku...@gmail.com> wrote:

Hi Folks,
>
> I am new to spark -and this is probably a basic question.
>
> I have a file on the hdfs
>
> 1, one
> 1, uno
> 2, two
> 2, dos
>
> I want to create a multi Map RDD  RDD[Map[String,List[String]]]
>
> {"1"->["one","uno"], "2"->["two","dos"]}
>
Actually what you described is not a “multi-map RDD”, the type of this RDD
should be something like RDD[(String, List[String]]. RDD[Map[String,
List[String]]] indicates that each element within this RDD is itself a
Map[String,
List[String]], and I don’t think this is what you want according to the
context.


>
> First I read the file
> val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()
>
 You don’t need to call .cache() here since identityData is used only once,
so the cached data won’t be used anywhere.


> val identityDataList:RDD[List[String]]=
>       identityData.map{ line =>
>         val splits= line.split(",")
>         splits.toList
>     }
>
 Turn the text line into a (String, String) pair would be much more useful
since then you can call functions like groupByKey, which are defined in
PairRDDFunctions:

val identityDataPairs: RDD[(String, String)] = identityData.map { line =>
  val Array(key, value) = line.split(",")
  key -> value
}


> Then I group them by the first element
>
>  val grouped:RDD[(String,Iterable[List[String]])]=
>     songArtistDataList.groupBy{
>       element =>{
>         element(0)
>       }
>     }
>
Using groupByKey on pair RDDs is more convenient as mentioned above:

val grouped: RDD[(String, Iterable[String])] = identityDataPairs.groupByKey()


> Then I do the equivalent of mapValues of scala collections to get rid of
> the first element
>
>  val groupedWithValues:RDD[(String,List[String])] =
>     grouped.flatMap[(String,List[String])]{ case (key,list)=>{
>       List((key,list.map{element => {
>         element(1)
>       }}.toList))
>     }
>     }
>
> for this to actually materialize I do collect
>
>  val groupedAndCollected=groupedWithValues.collect()
>
> I get an Array[String,List[String]].
>
> I am trying to figure out if there is a way for me to get
> Map[String,List[String]] (a multimap), or to create an
> RDD[Map[String,List[String]] ]
>
To get a Map[String, Iterable[String]], you may simply call collectAsMap
which is only defined on pair RDDs:

val groupedAndCollected = grouped.collectAsMap()


>
> I am sure there is something simpler, I would appreciate advice.
>
> Many thanks,
> Amit
>
>  At last, be careful if you are processing large volume of data, since
groupByKey is an expensive transformation, and collecting all the data to
driver side may simply cause OOM if the data can’t fit in the driver node.


Best
Cheng
​