You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ayan guha <gu...@gmail.com> on 2016/11/12 00:10:20 UTC

Re: DataSet is not able to handle 50,000 columns to sum

You can explore grouping sets in SQL and write an aggregate function to add
array wise sum.

It will boil down to something like

Select attr1,attr2...,yourAgg(Val)
From t
Group by attr1,attr2...
Grouping sets((attr1,attr2),(aytr1))
On 12 Nov 2016 04:57, "Anil Langote" <an...@gmail.com> wrote:

> Hi All,
>
>
>
> I have been working on one use case and couldn’t able to think the better
> solution, I have seen you very active on spark user list please throw your
> thoughts on implementation. Below is the requirement.
>
>
>
> I have tried using dataset by splitting the double array column but it
> fails when double size grows. When I create the double array schema data
> type spark doesn’t allow me to sum them because it would be done only on
> numeric types. If I think about storing the file per combination wise to
> parquet there will be too much parquet files.
>
>
>
> *Input : * The input file will be like below in real data the attributes
> will be *20 *& the double array would be *50,000*
>
>
>
>
>
> Attribute_0
>
> Attribute_1
>
> Attribute_2
>
> Attribute_3
>
> DoubleArray
>
> 5
>
> 3
>
> 5
>
> 3
>
> 0.2938933463658645  0.0437040427073041  0.23002681025029648
> 0.18003221216680454
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.5353599620508771  0.026777650111232787  0.31473082754161674
> 0.2647786522276575
>
> 5
>
> 3
>
> 5
>
> 2
>
> 0.8803063581705307  0.8101324740101096  0.48523937757683544
> 0.5897714618376072
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.33960064683141955  0.46537001358164043  0.543428826489435
> 0.42653939565053034
>
> 2
>
> 2
>
> 0
>
> 5
>
> 0.5108235777360906  0.4368119043922922  0.8651556676944931
> 0.7451477943975504
>
>
>
> Now below are the possible combinations in above data set this will be all
> possible combinations
>
>
>
> 1.      Attribute_0, Attribute_1
>
> 2.      Attribute_0, Attribute_2
>
> 3.      Attribute_0, Attribute_3
>
> 4.      Attribute_1, Attribute_2
>
> 5.      Attribute_2, Attribute_3
>
> 6.      Attribute_1, Attribute_3
>
> 7.      Attribute_0, Attribute_1, Attribute_2
>
> 8.      Attribute_0, Attribute_1, Attribute_3
>
> 9.      Attribute_0, Attribute_2, Attribute_3
>
> 10.  Attribute_1, Attribute_2, Attribute_3
>
> 11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
>
>
>
> Now we have to process all these combinations on input data preferably
> parallel to get good performance.
>
>
>
> *Attribute_0, Attribute_1*
>
>
>
> In this iteration the other attributes (*Attribute_2, Attribute_3*) are
> not required all we need is Attribute_0, Attribute_1 & double array
> columns. If you see the data there are two possible combination in the data
> one is 5_3 and other one is 3_2 we have to pick only those which has at
> least 2 combinations in real data we will get in thousands.
>
>
>
>
>
> Attribute_0
>
> Attribute_1
>
> Attribute_2
>
> Attribute_3
>
> DoubleArray
>
> 5
>
> 3
>
> 5
>
> 3
>
> 0.2938933463658645  0.0437040427073041  0.23002681025029648
> 0.18003221216680454
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.5353599620508771  0.026777650111232787  0.31473082754161674
> 0.2647786522276575
>
> 5
>
> 3
>
> 5
>
> 2
>
> 0.8803063581705307  0.8101324740101096  0.48523937757683544
> 0.5897714618376072
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.33960064683141955  0.46537001358164043  0.543428826489435
> 0.42653939565053034
>
> 2
>
> 2
>
> 0
>
> 5
>
> 0.5108235777360906  0.4368119043922922  0.8651556676944931
> 0.7451477943975504
>
>
>
> when we do the groupBy on above dataset with columns *Attribute_0,
> Attribute_1 *we will get two records with keys 5_3 & 3_2 and each key
> will have two double arrays.
>
>
>
> 5_3 ==> 0.2938933463658645  0.0437040427073041  0.23002681025029648
> 0.18003221216680454 & 0.8803063581705307  0.8101324740101096
> 0.48523937757683544  0.5897714618376072
>
>
>
> 3_2 ==> 0.5353599620508771  0.026777650111232787  0.31473082754161674
> 0.2647786522276575 & 0.33960064683141955  0.46537001358164043
> 0.543428826489435  0.42653939565053034
>
>
>
> now we have to add these double arrays index wise and produce the one array
>
>
>
> 5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319,
> 0.7698036740044117]
>
> 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518,
> 0.6913180478781878]
>
>
>
> After adding we have to compute average, min, max etc on these vector and
> store the results against the keys.
>
>
>
> Same process will be repeated for next combinations.
>
>
>
>
>
>
>
> Thank you
>
> Anil Langote
>
> +1-425-633-9747
>
>
>

Re: DataSet is not able to handle 50,000 columns to sum

Posted by Anil Langote <an...@gmail.com>.

All right thanks for inputs is there any way spark can process all combination parallel in one job ? 

If is it ok to load the input csv file in dataframe and use flat map to create key pair, then use reduceByKey to sum the double array? I believe that will work same like agg function which you are suggesting.

Best Regards,
Anil Langote
+1-425-633-9747

> On Nov 11, 2016, at 7:10 PM, ayan guha <gu...@gmail.com> wrote:
> 
> You can explore grouping sets in SQL and write an aggregate function to add array wise sum.
> 
> It will boil down to something like
> 
> Select attr1,attr2...,yourAgg(Val)
> From t
> Group by attr1,attr2...
> Grouping sets((attr1,attr2),(aytr1))
> 
>> On 12 Nov 2016 04:57, "Anil Langote" <an...@gmail.com> wrote:
>> Hi All,
>> 
>>  
>> 
>> I have been working on one use case and couldn’t able to think the better solution, I have seen you very active on spark user list please throw your thoughts on implementation. Below is the requirement.
>> 
>>  
>> 
>> I have tried using dataset by splitting the double array column but it fails when double size grows. When I create the double array schema data type spark doesn’t allow me to sum them because it would be done only on numeric types. If I think about storing the file per combination wise to parquet there will be too much parquet files.
>> 
>>  
>> 
>> Input :  The input file will be like below in real data the attributes will be 20 & the double array would be 50,000
>> 
>>  
>> 
>>  
>> 
>> Attribute_0
>> 
>> Attribute_1
>> 
>> Attribute_2
>> 
>> Attribute_3
>> 
>> DoubleArray
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 3
>> 
>> 0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 2
>> 
>> 0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
>> 
>> 2
>> 
>> 2
>> 
>> 0
>> 
>> 5
>> 
>> 0.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504
>> 
>>  
>> 
>> Now below are the possible combinations in above data set this will be all possible combinations
>> 
>>  
>> 
>> 1.      Attribute_0, Attribute_1
>> 
>> 2.      Attribute_0, Attribute_2
>> 
>> 3.      Attribute_0, Attribute_3
>> 
>> 4.      Attribute_1, Attribute_2
>> 
>> 5.      Attribute_2, Attribute_3
>> 
>> 6.      Attribute_1, Attribute_3
>> 
>> 7.      Attribute_0, Attribute_1, Attribute_2
>> 
>> 8.      Attribute_0, Attribute_1, Attribute_3
>> 
>> 9.      Attribute_0, Attribute_2, Attribute_3
>> 
>> 10.  Attribute_1, Attribute_2, Attribute_3
>> 
>> 11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
>> 
>>  
>> 
>> Now we have to process all these combinations on input data preferably parallel to get good performance.
>> 
>>  
>> 
>> Attribute_0, Attribute_1
>> 
>>  
>> 
>> In this iteration the other attributes (Attribute_2, Attribute_3) are not required all we need is Attribute_0, Attribute_1 & double array columns. If you see the data there are two possible combination in the data one is 5_3 and other one is 3_2 we have to pick only those which has at least 2 combinations in real data we will get in thousands. 
>> 
>>  
>> 
>>  
>> 
>> Attribute_0
>> 
>> Attribute_1
>> 
>> Attribute_2
>> 
>> Attribute_3
>> 
>> DoubleArray
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 3
>> 
>> 0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 2
>> 
>> 0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
>> 
>> 2
>> 
>> 2
>> 
>> 0
>> 
>> 5
>> 
>> 0.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504
>> 
>>  
>> 
>> when we do the groupBy on above dataset with columns Attribute_0, Attribute_1 we will get two records with keys 5_3 & 3_2 and each key will have two double arrays.
>> 
>>  
>> 
>> 5_3 ==> 0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454 & 0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
>> 
>>  
>> 
>> 3_2 ==> 0.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575 & 0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
>> 
>>  
>> 
>> now we have to add these double arrays index wise and produce the one array
>> 
>>  
>> 
>> 5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 0.7698036740044117]
>> 
>> 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 0.6913180478781878]
>> 
>>  
>> 
>> After adding we have to compute average, min, max etc on these vector and store the results against the keys.
>> 
>>  
>> 
>> Same process will be repeated for next combinations. 
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> Thank you
>> 
>> Anil Langote
>> 
>> +1-425-633-9747
>> 
>>