You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by VG <vl...@gmail.com> on 2016/07/22 18:21:56 UTC

How to search on a Dataset / RDD

Any suggestions here  please

I basically need an ability to look up *name -> index* and *index -> name*
in the code

-VG

On Fri, Jul 22, 2016 at 6:40 PM, VG <vl...@gmail.com> wrote:

> Hi All,
>
> I am really confused how to proceed further. Please help.
>
> I have a dataset created as follows:
> Dataset<Row> b = sqlContext.sql("SELECT bid, name FROM business");
>
> Now I need to map each name with a unique index and I did the following
> JavaPairRDD<Row, Long> indexedBId = business.javaRDD()
>                                                            .zipWithIndex();
>
> In later part of the code I need to change a datastructure and update name
> with index value generated above .
> I am unable to figure out how to do a look up here..
>
> Please suggest /.
>
> If there is a better way to do this please suggest that.
>
> Regards
> VG
>
>

Re: How to search on a Dataset / RDD

Posted by Pedro Rodriguez <sk...@gmail.com>.
You might look at monotonically_increasing_id() here
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions
instead of converting it to an RDD. since you pay a performance penalty for
that.

If you want to change the name you can do something like this (in scala
since I am not familiar with java API, but it should be similar in java)

val df = sqlContext.sql("select bid, name from
business").withColumn(monotonically_increasing_id().as("id")
// some steps later on
df.withColumn("name", $"id")

I am not 100% what you mean by updating the data structure, I am guessing
you mean replace the name column with the id column? Not, on the second
line the withColumn call uses $"id" which in scala converts to a Column. In
java maybe its something like new Column("id"), not sure.

Pedro

On Fri, Jul 22, 2016 at 12:21 PM, VG <vl...@gmail.com> wrote:

> Any suggestions here  please
>
> I basically need an ability to look up *name -> index* and *index -> name*
> in the code
>
> -VG
>
> On Fri, Jul 22, 2016 at 6:40 PM, VG <vl...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am really confused how to proceed further. Please help.
>>
>> I have a dataset created as follows:
>> Dataset<Row> b = sqlContext.sql("SELECT bid, name FROM business");
>>
>> Now I need to map each name with a unique index and I did the following
>> JavaPairRDD<Row, Long> indexedBId = business.javaRDD()
>>
>>  .zipWithIndex();
>>
>> In later part of the code I need to change a datastructure and update
>> name with index value generated above .
>> I am unable to figure out how to do a look up here..
>>
>> Please suggest /.
>>
>> If there is a better way to do this please suggest that.
>>
>> Regards
>> VG
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience