You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Krishna Sankar <ks...@gmail.com> on 2014/06/14 05:52:36 UTC
Multi-dimensional Uniques over large dataset
Hi,
Would appreciate insights and wisdom on a problem we are working on:
1. Context:
- Given a csv file like:
- d1,c1,a1
- d1,c1,a2
- d1,c2,a1
- d1,c1,a1
- d2,c1,a3
- d2,c2,a1
- d3,c1,a1
- d3,c3,a1
- d3,c2,a1
- d3,c3,a2
- d5,c1,a3
- d5,c2,a2
- d5,c3,a2
- Want to find uniques and totals (of the d_ across the c_ and a_
dimensions):
- Tot Unique
- c1 6 4
- c2 4 4
- c3 2 2
- a1 7 3
- a2 4 3
- a3 2 2
- c1-a1 ...
- c1-a2 ...
- c1-a3 ...
- c2-a1 ...
- c2-a2 ...
- ...
- c3-a3
- Obviously there are millions of records and more
attributes/dimensions. So scalability is key
2. We think Spark is a good stack for this problem: Have a few
questions:
3. From a Spark substrate perspective, what are some of the optimum
transformations & things to watch out for ?
4. Is PairRDD the best data representation ? GroupByKey et al is only
available for PairRDD.
5. On a pragmatic level, file.map().map() results in RDD. How do I
transform it to a PairRDD ?
1. .map(fields => (fields(1), fields(0)) - results in Unit
2. .map(fields => fields(1) -> fields(0)) also is not working
3. Both these do not result in a PairRDD
4. Am missing something fundamental.
Cheers & Have a nice weekend
<k/>
Re: Multi-dimensional Uniques over large dataset
Posted by Krishna Sankar <ks...@gmail.com>.
And got the first cut:
val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size))
gives the total & unique.
The question : is it scalable & efficient ? Would appreciate insights.
Cheers
<k/>
On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar <ks...@gmail.com>
wrote:
> Answered one of my questions (#5) : val pairs = new
> PairRDDFunctions(<RDD>) works fine locally. Now I can do groupByKey et al.
> Am not sure if it is scalable for millions of records & memory efficient.
> heers
> <k/>
>
>
> On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <ks...@gmail.com>
> wrote:
>
>> Hi,
>> Would appreciate insights and wisdom on a problem we are working on:
>>
>> 1. Context:
>> - Given a csv file like:
>> - d1,c1,a1
>> - d1,c1,a2
>> - d1,c2,a1
>> - d1,c1,a1
>> - d2,c1,a3
>> - d2,c2,a1
>> - d3,c1,a1
>> - d3,c3,a1
>> - d3,c2,a1
>> - d3,c3,a2
>> - d5,c1,a3
>> - d5,c2,a2
>> - d5,c3,a2
>> - Want to find uniques and totals (of the d_ across the c_ and a_
>> dimensions):
>> - Tot Unique
>> - c1 6 4
>> - c2 4 4
>> - c3 2 2
>> - a1 7 3
>> - a2 4 3
>> - a3 2 2
>> - c1-a1 ...
>> - c1-a2 ...
>> - c1-a3 ...
>> - c2-a1 ...
>> - c2-a2 ...
>> - ...
>> - c3-a3
>> - Obviously there are millions of records and more
>> attributes/dimensions. So scalability is key
>> 2. We think Spark is a good stack for this problem: Have a few
>> questions:
>> 3. From a Spark substrate perspective, what are some of the optimum
>> transformations & things to watch out for ?
>> 4. Is PairRDD the best data representation ? GroupByKey et al is only
>> available for PairRDD.
>> 5. On a pragmatic level, file.map().map() results in RDD. How do I
>> transform it to a PairRDD ?
>> 1. .map(fields => (fields(1), fields(0)) - results in Unit
>> 2. .map(fields => fields(1) -> fields(0)) also is not working
>> 3. Both these do not result in a PairRDD
>> 4. Am missing something fundamental.
>>
>> Cheers & Have a nice weekend
>> <k/>
>>
>
>
Re: Multi-dimensional Uniques over large dataset
Posted by Krishna Sankar <ks...@gmail.com>.
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(<RDD>)
works fine locally. Now I can do groupByKey et al. Am not sure if it is
scalable for millions of records & memory efficient.
heers
<k/>
On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <ks...@gmail.com> wrote:
> Hi,
> Would appreciate insights and wisdom on a problem we are working on:
>
> 1. Context:
> - Given a csv file like:
> - d1,c1,a1
> - d1,c1,a2
> - d1,c2,a1
> - d1,c1,a1
> - d2,c1,a3
> - d2,c2,a1
> - d3,c1,a1
> - d3,c3,a1
> - d3,c2,a1
> - d3,c3,a2
> - d5,c1,a3
> - d5,c2,a2
> - d5,c3,a2
> - Want to find uniques and totals (of the d_ across the c_ and a_
> dimensions):
> - Tot Unique
> - c1 6 4
> - c2 4 4
> - c3 2 2
> - a1 7 3
> - a2 4 3
> - a3 2 2
> - c1-a1 ...
> - c1-a2 ...
> - c1-a3 ...
> - c2-a1 ...
> - c2-a2 ...
> - ...
> - c3-a3
> - Obviously there are millions of records and more
> attributes/dimensions. So scalability is key
> 2. We think Spark is a good stack for this problem: Have a few
> questions:
> 3. From a Spark substrate perspective, what are some of the optimum
> transformations & things to watch out for ?
> 4. Is PairRDD the best data representation ? GroupByKey et al is only
> available for PairRDD.
> 5. On a pragmatic level, file.map().map() results in RDD. How do I
> transform it to a PairRDD ?
> 1. .map(fields => (fields(1), fields(0)) - results in Unit
> 2. .map(fields => fields(1) -> fields(0)) also is not working
> 3. Both these do not result in a PairRDD
> 4. Am missing something fundamental.
>
> Cheers & Have a nice weekend
> <k/>
>