You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Krishna Sankar <ks...@gmail.com> on 2014/06/14 05:52:36 UTC

Multi-dimensional Uniques over large dataset

Hi,
   Would appreciate insights and wisdom on a problem we are working on:

   1. Context:
      - Given a csv file like:
      - d1,c1,a1
      - d1,c1,a2
      - d1,c2,a1
      - d1,c1,a1
      - d2,c1,a3
      - d2,c2,a1
      - d3,c1,a1
      - d3,c3,a1
      - d3,c2,a1
      - d3,c3,a2
      - d5,c1,a3
      - d5,c2,a2
      - d5,c3,a2
      - Want to find uniques and totals (of the d_ across the c_ and a_
      dimensions):
      -         Tot   Unique
         - c1      6      4
         - c2      4      4
         - c3      2      2
         - a1      7      3
         - a2      4      3
         - a3      2      2
         - c1-a1  ...
         - c1-a2 ...
         - c1-a3 ...
         - c2-a1 ...
         - c2-a2 ...
         - ...
         - c3-a3
      - Obviously there are millions of records and more
      attributes/dimensions. So scalability is key
      2. We think Spark is a good stack for this problem: Have a few
   questions:
   3. From a Spark substrate perspective, what are some of the optimum
   transformations & things to watch out for ?
   4. Is PairRDD the best data representation ? GroupByKey et al is only
   available for PairRDD.
   5. On a pragmatic level, file.map().map() results in RDD. How do I
   transform it to a PairRDD ?
      1. .map(fields => (fields(1), fields(0)) - results in Unit
      2. .map(fields => fields(1) -> fields(0)) also is not working
      3. Both these do not result in a PairRDD
      4. Am missing something fundamental.

Cheers & Have a nice weekend
<k/>

Re: Multi-dimensional Uniques over large dataset

Posted by Krishna Sankar <ks...@gmail.com>.

And got the first cut:

    val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size))
gives the total & unique.

The question : is it scalable & efficient ? Would appreciate insights.

Cheers

<k/>


On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar <ks...@gmail.com>
wrote:

> Answered one of my questions (#5) : val pairs = new
> PairRDDFunctions(<RDD>) works fine locally. Now I can do groupByKey et al.
> Am not sure if it is scalable for millions of records & memory efficient.
> heers
> <k/>
>
>
> On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <ks...@gmail.com>
> wrote:
>
>> Hi,
>>    Would appreciate insights and wisdom on a problem we are working on:
>>
>>    1. Context:
>>       - Given a csv file like:
>>       - d1,c1,a1
>>       - d1,c1,a2
>>       - d1,c2,a1
>>       - d1,c1,a1
>>       - d2,c1,a3
>>       - d2,c2,a1
>>       - d3,c1,a1
>>       - d3,c3,a1
>>       - d3,c2,a1
>>       - d3,c3,a2
>>       - d5,c1,a3
>>       - d5,c2,a2
>>        - d5,c3,a2
>>       - Want to find uniques and totals (of the d_ across the c_ and a_
>>       dimensions):
>>       -         Tot   Unique
>>          - c1      6      4
>>          - c2      4      4
>>          - c3      2      2
>>          - a1      7      3
>>          - a2      4      3
>>          - a3      2      2
>>          - c1-a1  ...
>>          - c1-a2 ...
>>          - c1-a3 ...
>>          - c2-a1 ...
>>          - c2-a2 ...
>>          - ...
>>          - c3-a3
>>       - Obviously there are millions of records and more
>>       attributes/dimensions. So scalability is key
>>       2. We think Spark is a good stack for this problem: Have a few
>>    questions:
>>    3. From a Spark substrate perspective, what are some of the optimum
>>    transformations & things to watch out for ?
>>    4. Is PairRDD the best data representation ? GroupByKey et al is only
>>    available for PairRDD.
>>    5. On a pragmatic level, file.map().map() results in RDD. How do I
>>    transform it to a PairRDD ?
>>       1. .map(fields => (fields(1), fields(0)) - results in Unit
>>       2. .map(fields => fields(1) -> fields(0)) also is not working
>>       3. Both these do not result in a PairRDD
>>       4. Am missing something fundamental.
>>
>> Cheers & Have a nice weekend
>> <k/>
>>
>
>

Re: Multi-dimensional Uniques over large dataset

Posted by Krishna Sankar <ks...@gmail.com>.

Answered one of my questions (#5) : val pairs = new PairRDDFunctions(<RDD>)
works fine locally. Now I can do groupByKey et al. Am not sure if it is
scalable for millions of records & memory efficient.
heers
<k/>


On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <ks...@gmail.com> wrote:

> Hi,
>    Would appreciate insights and wisdom on a problem we are working on:
>
>    1. Context:
>       - Given a csv file like:
>       - d1,c1,a1
>       - d1,c1,a2
>       - d1,c2,a1
>       - d1,c1,a1
>       - d2,c1,a3
>       - d2,c2,a1
>       - d3,c1,a1
>       - d3,c3,a1
>       - d3,c2,a1
>       - d3,c3,a2
>       - d5,c1,a3
>       - d5,c2,a2
>       - d5,c3,a2
>       - Want to find uniques and totals (of the d_ across the c_ and a_
>       dimensions):
>       -         Tot   Unique
>          - c1      6      4
>          - c2      4      4
>          - c3      2      2
>          - a1      7      3
>          - a2      4      3
>          - a3      2      2
>          - c1-a1  ...
>          - c1-a2 ...
>          - c1-a3 ...
>          - c2-a1 ...
>          - c2-a2 ...
>          - ...
>          - c3-a3
>       - Obviously there are millions of records and more
>       attributes/dimensions. So scalability is key
>       2. We think Spark is a good stack for this problem: Have a few
>    questions:
>    3. From a Spark substrate perspective, what are some of the optimum
>    transformations & things to watch out for ?
>    4. Is PairRDD the best data representation ? GroupByKey et al is only
>    available for PairRDD.
>    5. On a pragmatic level, file.map().map() results in RDD. How do I
>    transform it to a PairRDD ?
>       1. .map(fields => (fields(1), fields(0)) - results in Unit
>       2. .map(fields => fields(1) -> fields(0)) also is not working
>       3. Both these do not result in a PairRDD
>       4. Am missing something fundamental.
>
> Cheers & Have a nice weekend
> <k/>
>