You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sakag <le...@gmail.com> on 2020/03/11 15:59:21 UTC

Time-based frequency table at scale

Hi all, 
 
We have a rather interesting use case, and are struggling to come up with an
approach that scales. Reaching out to seek your expert opinion/feedback and
tips. 
 
What we are trying to do is to find the count of numerical ids over a
sliding time window where each of our data records has a timestamp and a set
of numerical ids in the below format. 
 
timestamp | ids
1  [1,2,3,8]
1  [1,2]
2  [1,2,3,4]
2  [1, 10]
 
What we are looking to get as output is:
 
timestamp | id_count_map
1 | {1: 2, 2: 2, 3: 1, 4:0, 5:0, 6:0, 8:1}
2 | {1: 2, 2:1, 3: 1, 4: 1, 5:0, 6:0, 7:0, 8:0, 9:0, 10:1}
 
This gives us the frequency of occurrence of these ids over time periods.
Please note that the output expected is in a dense format.
 
However, we are running into scale issues with the data that has these
characteristics.
 
- 500 million records - Total ~100 GB
- Each record can have 500 elements in the ids column 
- Max id value (length of id_count_map) is 750K
 
We have tried the below approaches to achieve this 
1) Expanding ids to a dense, frequency-based vector and then doing a
row-wise sum over a Window partitioned by timestamp
2) Converting ids into a SparseVector and computing the L1 norm (using
Summarizer) over a Window partitioned by timestamp
3) GroupBy/aggregating ids by timestamp, converting to a sparse,
frequency-based vector using collections.Counter, and expanding to a dense
format
4) GroupBy/aggregating ids by timestamp, converting to a sparse,
frequency-based vector using CountVectorizer, and then expanding to a dense
format
 
Any other approaches we could try?
 
Thanks!
Sakshi
 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Time-based frequency table at scale

Posted by Nicolas Paris <ni...@riseup.net>.

Hi,

did you try exploding the arrays, then doing the aggregation/count and
at the end applying a udf to add the 0 values ?
my experience is working on arrays is usually a bad idea.

sakag <le...@gmail.com> writes:

> Hi all, 
>  
> We have a rather interesting use case, and are struggling to come up with an
> approach that scales. Reaching out to seek your expert opinion/feedback and
> tips. 
>  
> What we are trying to do is to find the count of numerical ids over a
> sliding time window where each of our data records has a timestamp and a set
> of numerical ids in the below format. 
>  
> timestamp | ids
> 1  [1,2,3,8]
> 1  [1,2]
> 2  [1,2,3,4]
> 2  [1, 10]
>  
> What we are looking to get as output is:
>  
> timestamp | id_count_map
> 1 | {1: 2, 2: 2, 3: 1, 4:0, 5:0, 6:0, 8:1}
> 2 | {1: 2, 2:1, 3: 1, 4: 1, 5:0, 6:0, 7:0, 8:0, 9:0, 10:1}
>  
> This gives us the frequency of occurrence of these ids over time periods.
> Please note that the output expected is in a dense format.
>  
> However, we are running into scale issues with the data that has these
> characteristics.
>  
> - 500 million records - Total ~100 GB
> - Each record can have 500 elements in the ids column 
> - Max id value (length of id_count_map) is 750K
>  
> We have tried the below approaches to achieve this 
> 1) Expanding ids to a dense, frequency-based vector and then doing a
> row-wise sum over a Window partitioned by timestamp
> 2) Converting ids into a SparseVector and computing the L1 norm (using
> Summarizer) over a Window partitioned by timestamp
> 3) GroupBy/aggregating ids by timestamp, converting to a sparse,
> frequency-based vector using collections.Counter, and expanding to a dense
> format
> 4) GroupBy/aggregating ids by timestamp, converting to a sparse,
> frequency-based vector using CountVectorizer, and then expanding to a dense
> format
>  
> Any other approaches we could try?
>  
> Thanks!
> Sakshi


-- 
nicolas paris

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Time-based frequency table at scale

Posted by Enrico Minack <ma...@Enrico.Minack.dev>.

An interesting puzzle indeed.

What is your measure of "that scales"? Does not fail, does not spill, 
does not need a huge amount of memory / disk, is O(N), processes X 
records per second and core?

Enrico

Am 11.03.20 um 16:59 schrieb sakag:
> Hi all,
>   
> We have a rather interesting use case, and are struggling to come up with an
> approach that scales. Reaching out to seek your expert opinion/feedback and
> tips.
>   
> What we are trying to do is to find the count of numerical ids over a
> sliding time window where each of our data records has a timestamp and a set
> of numerical ids in the below format.
>   
> timestamp | ids
> 1  [1,2,3,8]
> 1  [1,2]
> 2  [1,2,3,4]
> 2  [1, 10]
>   
> What we are looking to get as output is:
>   
> timestamp | id_count_map
> 1 | {1: 2, 2: 2, 3: 1, 4:0, 5:0, 6:0, 8:1}
> 2 | {1: 2, 2:1, 3: 1, 4: 1, 5:0, 6:0, 7:0, 8:0, 9:0, 10:1}
>   
> This gives us the frequency of occurrence of these ids over time periods.
> Please note that the output expected is in a dense format.
>   
> However, we are running into scale issues with the data that has these
> characteristics.
>   
> - 500 million records - Total ~100 GB
> - Each record can have 500 elements in the ids column
> - Max id value (length of id_count_map) is 750K
>   
> We have tried the below approaches to achieve this
> 1) Expanding ids to a dense, frequency-based vector and then doing a
> row-wise sum over a Window partitioned by timestamp
> 2) Converting ids into a SparseVector and computing the L1 norm (using
> Summarizer) over a Window partitioned by timestamp
> 3) GroupBy/aggregating ids by timestamp, converting to a sparse,
> frequency-based vector using collections.Counter, and expanding to a dense
> format
> 4) GroupBy/aggregating ids by timestamp, converting to a sparse,
> frequency-based vector using CountVectorizer, and then expanding to a dense
> format
>   
> Any other approaches we could try?
>   
> Thanks!
> Sakshi
>   
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org