You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Kali.tummala@gmail.com" <Ka...@gmail.com> on 2015/10/09 16:59:33 UTC

Create hashmap using two RDD's

Hi all, 

I am trying to create a hashmap using two rdd, but having issues key not
found ....
do I need to convert RDD to list first ?

1) rdd has key data
2) rdd has value data

Key Rdd:-
val quotekey=file.map(x => x.split("\\|")).filter(line =>
line(0).contains("1017")).map(x => x(5)+x(4))

Value Rdd:-
val QuoteRDD=quotefile.map(x => x.split("\\|")).
                 filter(line => line(0).contains("1017")).
                 map(x =>
(x(5).toString+x(4).toString,x(5).toString,x(4).toString,x(1).toString ,
                 if (x(15).toString =="B") 
                       if (x(25).toString =="") x(9).toString  else
(x(25).toString),
                       if (x(37).toString =="") x(11).toString else
(x(37).toString),
                if (x(15).toString =="C") 
                       if (x(24).toString =="") x(9).toString  else
(x(24).toString),
                       if (x(30).toString =="") x(11).toString else
(x(30).toString),
                 if (x(15).toString =="A") 
                       x(9).toString,
                       x(11).toString
                    ))

Hash Map:-
val quotehash = new HashMap[String,String]
quotehash + quotekey.toString() -> QuoteRDD
quotehash("CPHI080000172")

Error:-
key not found: CPHI080000172

Thanks








--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-hashmap-using-two-RDD-s-tp24996.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Create hashmap using two RDD's

Posted by Sri <ka...@gmail.com>.
Thanks Richard , will give a try tomorrow...

Thanks
Sri

Sent from my iPhone

> On 10 Oct 2015, at 19:15, Richard Eggert <ri...@gmail.com> wrote:
> 
> You should be able to achieve what you're looking for by using foldByKey to find the latest record for each key. If you're relying on the order elements within the file to determine which ones are the "latest" (rather than sorting by some field within the file itself), call zipWithIndex first to give each element a numeric index that you can use for comparisons.
> 
> For example (type annotations are unnecessary but included for clarity):
> val parsedRecords : RDD[(Key, Value)] = ???
> val indexedRecords : RDD[(Key, (Value, Int))] = parsedRecords.zipWithIndex map {case ((k, v), n) => k -> (v,n)}
> val latestRecords : RDD[(Key, Value)] = indexedRecords foldByKey(null) {(a, b) => (a,b) match {
>    case (null, _) => b
>    case ((_, an), (_, bn)) if an < bn => b
>    case _ => a
> } mapValues {case (v, _) => v}
> 
> You can then write "latestRecords" out to a file however you like. Note that I would recommend using string interpolation or the CSV output format (for DataFrames) over that string replacement you are currently using to format your output.
> 
>> On Sat, Oct 10, 2015 at 1:11 PM, Kali <ka...@gmail.com> wrote:
>> Hi Richard,
>> 
>> Requirement is to get latest records using a key i think hash map is a good choice for this task.
>> As of now data comes from third party and we are not sure what's the latest record is so hash map is chosen.
>> Is there anything better than hash map please let me know.
>> 
>> Thanks
>> Sri
>> 
>> Sent from my iPad
>> 
>>> On 10 Oct 2015, at 17:10, Richard Eggert <ri...@gmail.com> wrote:
>>> 
>>> Do you need the HashMap for anything else besides writing out to a file? If not, there is really no need to create one at all.  You could just keep everything as RDDs.
>>> 
>>>> On Oct 10, 2015 11:31 AM, "Kali.tummala@gmail.com" <Ka...@gmail.com> wrote:
>>>> Got it ..., created hashmap and saved it to file please follow below steps ..
>>>> 
>>>>  val QuoteRDD=quotefile.map(x => x.split("\\|")).
>>>>       filter(line => line(0).contains("1017")).
>>>>       map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
>>>>       if (x(15) =="B")
>>>>         (
>>>>           {if (x(25) == "") x(9)  else x(25)},
>>>>          {if (x(37) == "") x(11) else x(37)}
>>>>         )
>>>>       else if (x(15) =="C" )
>>>>         (
>>>>            {if (x(24) == "") (x(9))  else x(24)},
>>>>            {if (x(30) == "") (x(11)) else x(30)}
>>>>         )
>>>>       else if (x(15) =="A")
>>>>          {(x(9),x(11))}
>>>>         )))
>>>> 
>>>> 
>>>>     val QuoteHashMap=QuoteRDD.collect().toMap
>>>>     val test=QuoteHashMap.values.toSeq
>>>>     val test2=sc.parallelize(test.map(x =>
>>>> x.toString.replace("(","").replace(")","")))
>>>>     test2.saveAsTextFile("C:\\Users\\kalit_000\\Desktop\\mkdata\\test.txt")
>>>>     test2.collect().foreach(println)
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-hashmap-using-two-RDD-s-tp24996p25014.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
> 
> 
> 
> -- 
> Rich

Re: Create hashmap using two RDD's

Posted by Richard Eggert <ri...@gmail.com>.
You should be able to achieve what you're looking for by using foldByKey to
find the latest record for each key. If you're relying on the order
elements within the file to determine which ones are the "latest" (rather
than sorting by some field within the file itself), call zipWithIndex first
to give each element a numeric index that you can use for comparisons.

For example (type annotations are unnecessary but included for clarity):
val parsedRecords : RDD[(Key, Value)] = ???
val indexedRecords : RDD[(Key, (Value, Int))] = parsedRecords.zipWithIndex
map {case ((k, v), n) => k -> (v,n)}
val latestRecords : RDD[(Key, Value)] = indexedRecords foldByKey(null) {(a,
b) => (a,b) match {
   case (null, _) => b
   case ((_, an), (_, bn)) if an < bn => b
   case _ => a
} mapValues {case (v, _) => v}

You can then write "latestRecords" out to a file however you like. Note
that I would recommend using string interpolation or the CSV output format
(for DataFrames) over that string replacement you are currently using to
format your output.

On Sat, Oct 10, 2015 at 1:11 PM, Kali <ka...@gmail.com> wrote:

> Hi Richard,
>
> Requirement is to get latest records using a key i think hash map is a
> good choice for this task.
> As of now data comes from third party and we are not sure what's the
> latest record is so hash map is chosen.
> Is there anything better than hash map please let me know.
>
> Thanks
> Sri
>
> Sent from my iPad
>
> On 10 Oct 2015, at 17:10, Richard Eggert <ri...@gmail.com> wrote:
>
> Do you need the HashMap for anything else besides writing out to a file?
> If not, there is really no need to create one at all.  You could just keep
> everything as RDDs.
> On Oct 10, 2015 11:31 AM, "Kali.tummala@gmail.com" <Ka...@gmail.com>
> wrote:
>
>> Got it ..., created hashmap and saved it to file please follow below
>> steps ..
>>
>>  val QuoteRDD=quotefile.map(x => x.split("\\|")).
>>       filter(line => line(0).contains("1017")).
>>       map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
>>       if (x(15) =="B")
>>         (
>>           {if (x(25) == "") x(9)  else x(25)},
>>          {if (x(37) == "") x(11) else x(37)}
>>         )
>>       else if (x(15) =="C" )
>>         (
>>            {if (x(24) == "") (x(9))  else x(24)},
>>            {if (x(30) == "") (x(11)) else x(30)}
>>         )
>>       else if (x(15) =="A")
>>          {(x(9),x(11))}
>>         )))
>>
>>
>>     val QuoteHashMap=QuoteRDD.collect().toMap
>>     val test=QuoteHashMap.values.toSeq
>>     val test2=sc.parallelize(test.map(x =>
>> x.toString.replace("(","").replace(")","")))
>>
>> test2.saveAsTextFile("C:\\Users\\kalit_000\\Desktop\\mkdata\\test.txt")
>>     test2.collect().foreach(println)
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Create-hashmap-using-two-RDD-s-tp24996p25014.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>


-- 
Rich

Re: Create hashmap using two RDD's

Posted by Kali <ka...@gmail.com>.
Hi Richard,

Requirement is to get latest records using a key i think hash map is a good choice for this task.
As of now data comes from third party and we are not sure what's the latest record is so hash map is chosen.
Is there anything better than hash map please let me know.

Thanks
Sri

Sent from my iPad

> On 10 Oct 2015, at 17:10, Richard Eggert <ri...@gmail.com> wrote:
> 
> Do you need the HashMap for anything else besides writing out to a file? If not, there is really no need to create one at all.  You could just keep everything as RDDs.
> 
>> On Oct 10, 2015 11:31 AM, "Kali.tummala@gmail.com" <Ka...@gmail.com> wrote:
>> Got it ..., created hashmap and saved it to file please follow below steps ..
>> 
>>  val QuoteRDD=quotefile.map(x => x.split("\\|")).
>>       filter(line => line(0).contains("1017")).
>>       map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
>>       if (x(15) =="B")
>>         (
>>           {if (x(25) == "") x(9)  else x(25)},
>>          {if (x(37) == "") x(11) else x(37)}
>>         )
>>       else if (x(15) =="C" )
>>         (
>>            {if (x(24) == "") (x(9))  else x(24)},
>>            {if (x(30) == "") (x(11)) else x(30)}
>>         )
>>       else if (x(15) =="A")
>>          {(x(9),x(11))}
>>         )))
>> 
>> 
>>     val QuoteHashMap=QuoteRDD.collect().toMap
>>     val test=QuoteHashMap.values.toSeq
>>     val test2=sc.parallelize(test.map(x =>
>> x.toString.replace("(","").replace(")","")))
>>     test2.saveAsTextFile("C:\\Users\\kalit_000\\Desktop\\mkdata\\test.txt")
>>     test2.collect().foreach(println)
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-hashmap-using-two-RDD-s-tp24996p25014.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org

Re: Create hashmap using two RDD's

Posted by Richard Eggert <ri...@gmail.com>.
Do you need the HashMap for anything else besides writing out to a file? If
not, there is really no need to create one at all.  You could just keep
everything as RDDs.
On Oct 10, 2015 11:31 AM, "Kali.tummala@gmail.com" <Ka...@gmail.com>
wrote:

> Got it ..., created hashmap and saved it to file please follow below steps
> ..
>
>  val QuoteRDD=quotefile.map(x => x.split("\\|")).
>       filter(line => line(0).contains("1017")).
>       map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
>       if (x(15) =="B")
>         (
>           {if (x(25) == "") x(9)  else x(25)},
>          {if (x(37) == "") x(11) else x(37)}
>         )
>       else if (x(15) =="C" )
>         (
>            {if (x(24) == "") (x(9))  else x(24)},
>            {if (x(30) == "") (x(11)) else x(30)}
>         )
>       else if (x(15) =="A")
>          {(x(9),x(11))}
>         )))
>
>
>     val QuoteHashMap=QuoteRDD.collect().toMap
>     val test=QuoteHashMap.values.toSeq
>     val test2=sc.parallelize(test.map(x =>
> x.toString.replace("(","").replace(")","")))
>     test2.saveAsTextFile("C:\\Users\\kalit_000\\Desktop\\mkdata\\test.txt")
>     test2.collect().foreach(println)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Create-hashmap-using-two-RDD-s-tp24996p25014.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Create hashmap using two RDD's

Posted by "Kali.tummala@gmail.com" <Ka...@gmail.com>.
Got it ..., created hashmap and saved it to file please follow below steps ..

 val QuoteRDD=quotefile.map(x => x.split("\\|")).
      filter(line => line(0).contains("1017")).
      map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
      if (x(15) =="B")
        (
          {if (x(25) == "") x(9)  else x(25)},
         {if (x(37) == "") x(11) else x(37)}
        )
      else if (x(15) =="C" )
        (
           {if (x(24) == "") (x(9))  else x(24)},
           {if (x(30) == "") (x(11)) else x(30)}
        )
      else if (x(15) =="A")
         {(x(9),x(11))}
        )))


    val QuoteHashMap=QuoteRDD.collect().toMap
    val test=QuoteHashMap.values.toSeq
    val test2=sc.parallelize(test.map(x =>
x.toString.replace("(","").replace(")","")))
    test2.saveAsTextFile("C:\\Users\\kalit_000\\Desktop\\mkdata\\test.txt")
    test2.collect().foreach(println)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-hashmap-using-two-RDD-s-tp24996p25014.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Create hashmap using two RDD's

Posted by "Kali.tummala@gmail.com" <Ka...@gmail.com>.
Hi All, 

I changed my way of approach now I am bale to load data into MAP and get
data out using get command.

val QuoteRDD=quotefile.map(x => x.split("\\|")).
      filter(line => line(0).contains("1017")).
      map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
        if (x(15) =="B") 
          if (x(25) =="") x(9)  else (x(25)),
        if (x(37) =="") x(11) else (x(37)),
        if (x(15) =="C") 
          if (x(24) =="") x(9)  else (x(24)),
        if (x(30) =="") x(11) else (x(30)),
        if (x(15) =="A") 
          x(9),
        x(11)
      )))

    val QuoteHashMap=QuoteRDD.collect().toMap
    QuoteHashMap.get("CPHI080000173").foreach(println)

Problem now is how to save value data from the hashmap to a file ? I need to
iterate the keys in hash and save it to a file....

for ( ((k,v)) <-QuoteHashMap) QuoteHashMap.get(k)

Thanks
Sri 







--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-hashmap-using-two-RDD-s-tp24996p25008.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org