You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2022/11/13 02:19:59 UTC

[GitHub] [systemds] BACtaki opened a new pull request, #1727: [SYSTEMDS-3458] Adding Spark backend support for countDistinct() builtin

BACtaki opened a new pull request, #1727:
URL: https://github.com/apache/systemds/pull/1727

   ## Description
   This patch adds support for running countDistinct() builtin on the Spark backend. The implementation adds a new CountDistinctFunctionSketch to the sketches library. The sketch is based on an optimized HashMap lookup:
       Let [sign (1) | exponent (11) | fraction (52)] represent the bits of a double value
   
       - Each double is split into 2 parts:
           - PartA: [sign (1) | exponent (11)] and
           - PartB: [fraction (52)]
       - PartA is used as the key into the HashMap, and PartB is stored in the Set<Long> value
       - Many values in blkIn may have the same exponent, so we need to store multiple PartB values for any given PartA key
       - The HashMap is then serialized into the sketch like so:
           row_i: [exponent_i, N_i, fraction_i0, fraction_i1, .., fraction_iN, 0, .., 0]
   
           - exponent_i: it is the short int value of the bit representation of the original double input
           - N_i: we store the number of fractions per exponent value to avoid dealing with jagged matrices
       
   NB: Although countDistinct() does not strictly require us to keep track of the individual input elements, we must do so in the sketch so we can support the union() op for the MULTI_BLOCK case (see notes below)
   
   In the worst case, blkIn can have 1000x1000 = 10e6 unique values. Therefore, total memory is bounded above by:
   ```
       Max of
           (10e6 * 2) + (10e6 * 8) bytes   -> 10e6 keys, each with a single value
               and 2 + (10e6 * 8) bytes    -> a single key, with unique 10e6 fraction parts
       = (10e6 * 2) + (10e6 * 8) = 10e6 * 10 bytes < 10 MB
   ```
   
   ## Notes to Reviewers
       - The current implementation only supports the SINGLE_BLOCK case, i.e. the current implementation works for input matrices of max size 1000 x 1000. Larger inputs (along either dimension) will throw a `NotImplementedException`. MULTI_BLOCK aggregation support will be added in a future patch.
       - This patch adds support for only the default RowCol direction; we will add support for Row/Col direction in a future patch.
   
   ## Tests
       [X] Unit tests
       [ ] Integration tests
       [ ] N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard closed pull request #1727: [SYSTEMDS-3458] Adding Spark backend support for countDistinct() builtin

Posted by GitBox <gi...@apache.org>.

Baunsgaard closed pull request #1727: [SYSTEMDS-3458] Adding Spark backend support for countDistinct() builtin
URL: https://github.com/apache/systemds/pull/1727


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org