You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Khurram Faraaz <kf...@maprtech.com> on 2016/07/25 20:37:53 UTC

Overflow detection in Drill

Hi All,

As of today Drill does not handle overflow detection and does not report
that was an overflow to users, instead we just return results that are
incorrect. This issue has been discussed (but not in detail) in the past.

It would be great if Drill also handled overflow detection in data of type
(int, bigint etc) like other existing DBMSs do. Users will not want to see
incorrect/wrong results, instead an error that informs users that there was
an overflow will make more sense.

Here is an example of one such query that returns incorrect results as
compared to Postgres. Difference in results (related to overflow detection
problem), col1 is of type BIGINT

{noformat}
0: jdbc:drill:schema=dfs.tmp> SELECT col1, AVG(SUM(col1)) OVER ( PARTITION
BY col7 ORDER BY col0 ) FROM `allTypsUniq.parquet` GROUP BY col0,col1,col7;
+----------------------+--------------------------+
|         col1         |          EXPR$1          |
+----------------------+--------------------------+
| 5000                 | 5000.0                   |
| 9223372036854775807  | -4.6116860184273853E18   |
| 65534                | -3.0744573456182349E18   |
| -1                   | -2.30584300921367629E18  |
| 1                    | -1.84467440737094093E18  |
| 17                   | -1.53722867280911744E18  |
| 1000                 | -1.31762457669352909E18  |
| 200                  | -1.15292150460683802E18  |
| 4611686018427387903  | -5.1240955760303514E17   |
| 1001                 | -4.6116860184273152E17   |
| 30                   | -4.1924418349339232E17   |
| -65535               | -65535.0                 |
| 10000000             | 4967232.5                |
| 0                    | 3311488.3333333335       |
| 13                   | 2483619.5                |
| 23                   | 1986900.2                |
| 9999999              | 3322416.6666666665       |
| 197                  | 2847813.8571428573       |
| 9223372036854775806  | -1.1529215046043552E18   |
| 92233720385475807    | -1.01457092404992947E18  |
| 25                   | -9.1311383164493645E17   |
| 3000                 | -8.3010348331357837E17   |
+----------------------+--------------------------+
22 rows selected (0.46 seconds)
{noformat}

Results from Postgres

{noformat}
postgres=# SELECT col1, AVG(SUM(col1)) OVER ( PARTITION BY col7 ORDER BY
col0 ) FROM fewrwspqq_101 GROUP BY col0,col1,col7;
        col1         |          avg
---------------------+-----------------------
                5000 | 5000.0000000000000000
 9223372036854775807 |   4611686018427390404
               65534 |   3074457345618282114
                  -1 |   2305843009213711585
                   1 |   1844674407370969268
                  17 |   1537228672809141060
                1000 |   1317624576693549623
                 200 |   1152921504606855945
 4611686018427387903 |   1537228672809137273
                1001 |   1383505805528223646
                  30 |   1257732550480203317
              -65535 |   -65535.000000000000
            10000000 |  4967232.500000000000
                   0 |  3311488.333333333333
                  13 |  2483619.500000000000
                  23 |  1986900.200000000000
             9999999 |  3322416.666666666667
                 197 |  2847813.857142857143
 9223372036854775806 |   1152921504609338813
   92233720385475807 |   1035067306362242923
                  25 |    931560575726018634
                3000 |    846873250660017212
(22 rows)
{noformat}

Thanks,
Khurram

Re: Overflow detection in Drill

Posted by Ted Dunning <te...@gmail.com>.
This is a little bit off-topic, but there should be no risk of overflow in
a well-designed computation of the mean.

The simple and obvious algorithm where you add up all the numbers and
divide by the count is simply the wrong way to compute the mean if you want
numerical stability. The method sometimes attributed to Welford is
considerably better and avoids the risk of overflow.

This situation is event worse if you accumulate variance (aka standard
deviation squared) at the same time that you accumulate the mean.

Here are specific examples framed as computations against an array of
values.

*WRONG:*

long count = 0;
double sum = 0;
for (int i = 0; i < x.length; i++) {
    count++;
    sum += x[i];
}
mean = sum /count;


*VERY WRONG:*

long count = 0;
double sum = 0;
double sum2 = 0;
for (int i = 0; i < x.length; i++) {
    count++;
    sum += x[i];
    sum2 += x[i] * x[i];
}
mean = sum / count;
variance = (sum2 - sum * sum) / count;


*RIGHT:*

long count = 0;
double mean = 0;
double variance = 0;
for (int i = 0; i < x.length; i++) {
    count++;
    double before = x[i] - mean;
    mean += before / count;
    double after = x[i] - mean;
    variance += (before * after - variance) / count;
}



On Tue, Jul 26, 2016 at 4:39 AM, Khurram Faraaz <kf...@maprtech.com>
wrote:

> Another example where we don't detect/report overflow
>
> Results from Postgres
>
> postgres=# SELECT col0, AVG(col0) OVER ( ORDER BY col0 + col1 ) avg_col0
> FROM fewrwspqq_101 GROUP BY col0,col1;
>
> ERROR:  bigint out of range
>
> postgres=#
>
> Results from Drill 1.8.0
>
> 0: jdbc:drill:schema=dfs.tmp> SELECT col0, AVG(col0) OVER ( ORDER BY col0 +
> col1 ) avg_col0 FROM `allTypsUniq.parquet` GROUP BY col0,col1;
> +-------------+-----------------------+
> |    col0     |       avg_col0        |
> +-------------+-----------------------+
> | 23          | 23.0                  |
> | -1          | 11.0                  |
> | -65535      | -21837.666666666668   |
> | 3           | -16377.5              |
> | 4           | -13101.2              |
> | 5           | -10916.833333333334   |
> | 6           | -9356.42857142857     |
> | 7           | -8186.0               |
> | 8           | -7275.555555555556    |
> | 13          | -6546.7               |
> | 19          | -5949.818181818182    |
> | 9           | -5453.25              |
> | 1           | -5033.692307692308    |
> | 65535       | 6.928571428571429     |
> | 2           | 6.6                   |
> | 10          | 6.8125                |
> | 10000000    | 588241.7058823529     |
> | 1073741823  | 6.020788511111111E7   |
> | 2147483647  | 1.7006450415789473E8  |
> | 109         | 1.615612844E8         |
> | 29          | 1.538678912857143E8   |
> | 0           | 1.4687389622727272E8  |
> +-------------+-----------------------+
> 22 rows selected (0.341 seconds)
>
>
>
> On Tue, Jul 26, 2016 at 2:07 AM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > Hi All,
> >
> > As of today Drill does not handle overflow detection and does not report
> > that was an overflow to users, instead we just return results that are
> > incorrect. This issue has been discussed (but not in detail) in the past.
> >
> > It would be great if Drill also handled overflow detection in data of
> type
> > (int, bigint etc) like other existing DBMSs do. Users will not want to
> see
> > incorrect/wrong results, instead an error that informs users that there
> was
> > an overflow will make more sense.
> >
> > Here is an example of one such query that returns incorrect results as
> > compared to Postgres. Difference in results (related to overflow
> detection
> > problem), col1 is of type BIGINT
> >
> > {noformat}
> > 0: jdbc:drill:schema=dfs.tmp> SELECT col1, AVG(SUM(col1)) OVER (
> PARTITION
> > BY col7 ORDER BY col0 ) FROM `allTypsUniq.parquet` GROUP BY
> col0,col1,col7;
> > +----------------------+--------------------------+
> > |         col1         |          EXPR$1          |
> > +----------------------+--------------------------+
> > | 5000                 | 5000.0                   |
> > | 9223372036854775807  | -4.6116860184273853E18   |
> > | 65534                | -3.0744573456182349E18   |
> > | -1                   | -2.30584300921367629E18  |
> > | 1                    | -1.84467440737094093E18  |
> > | 17                   | -1.53722867280911744E18  |
> > | 1000                 | -1.31762457669352909E18  |
> > | 200                  | -1.15292150460683802E18  |
> > | 4611686018427387903  | -5.1240955760303514E17   |
> > | 1001                 | -4.6116860184273152E17   |
> > | 30                   | -4.1924418349339232E17   |
> > | -65535               | -65535.0                 |
> > | 10000000             | 4967232.5                |
> > | 0                    | 3311488.3333333335       |
> > | 13                   | 2483619.5                |
> > | 23                   | 1986900.2                |
> > | 9999999              | 3322416.6666666665       |
> > | 197                  | 2847813.8571428573       |
> > | 9223372036854775806  | -1.1529215046043552E18   |
> > | 92233720385475807    | -1.01457092404992947E18  |
> > | 25                   | -9.1311383164493645E17   |
> > | 3000                 | -8.3010348331357837E17   |
> > +----------------------+--------------------------+
> > 22 rows selected (0.46 seconds)
> > {noformat}
> >
> > Results from Postgres
> >
> > {noformat}
> > postgres=# SELECT col1, AVG(SUM(col1)) OVER ( PARTITION BY col7 ORDER BY
> > col0 ) FROM fewrwspqq_101 GROUP BY col0,col1,col7;
> >         col1         |          avg
> > ---------------------+-----------------------
> >                 5000 | 5000.0000000000000000
> >  9223372036854775807 |   4611686018427390404
> >                65534 |   3074457345618282114
> >                   -1 |   2305843009213711585
> >                    1 |   1844674407370969268
> >                   17 |   1537228672809141060
> >                 1000 |   1317624576693549623
> >                  200 |   1152921504606855945
> >  4611686018427387903 |   1537228672809137273
> >                 1001 |   1383505805528223646
> >                   30 |   1257732550480203317
> >               -65535 |   -65535.000000000000
> >             10000000 |  4967232.500000000000
> >                    0 |  3311488.333333333333
> >                   13 |  2483619.500000000000
> >                   23 |  1986900.200000000000
> >              9999999 |  3322416.666666666667
> >                  197 |  2847813.857142857143
> >  9223372036854775806 |   1152921504609338813
> >    92233720385475807 |   1035067306362242923
> >                   25 |    931560575726018634
> >                 3000 |    846873250660017212
> > (22 rows)
> > {noformat}
> >
> > Thanks,
> > Khurram
> >
>

Re: Overflow detection in Drill

Posted by Khurram Faraaz <kf...@maprtech.com>.
Another example where we don't detect/report overflow

Results from Postgres

postgres=# SELECT col0, AVG(col0) OVER ( ORDER BY col0 + col1 ) avg_col0
FROM fewrwspqq_101 GROUP BY col0,col1;

ERROR:  bigint out of range

postgres=#

Results from Drill 1.8.0

0: jdbc:drill:schema=dfs.tmp> SELECT col0, AVG(col0) OVER ( ORDER BY col0 +
col1 ) avg_col0 FROM `allTypsUniq.parquet` GROUP BY col0,col1;
+-------------+-----------------------+
|    col0     |       avg_col0        |
+-------------+-----------------------+
| 23          | 23.0                  |
| -1          | 11.0                  |
| -65535      | -21837.666666666668   |
| 3           | -16377.5              |
| 4           | -13101.2              |
| 5           | -10916.833333333334   |
| 6           | -9356.42857142857     |
| 7           | -8186.0               |
| 8           | -7275.555555555556    |
| 13          | -6546.7               |
| 19          | -5949.818181818182    |
| 9           | -5453.25              |
| 1           | -5033.692307692308    |
| 65535       | 6.928571428571429     |
| 2           | 6.6                   |
| 10          | 6.8125                |
| 10000000    | 588241.7058823529     |
| 1073741823  | 6.020788511111111E7   |
| 2147483647  | 1.7006450415789473E8  |
| 109         | 1.615612844E8         |
| 29          | 1.538678912857143E8   |
| 0           | 1.4687389622727272E8  |
+-------------+-----------------------+
22 rows selected (0.341 seconds)



On Tue, Jul 26, 2016 at 2:07 AM, Khurram Faraaz <kf...@maprtech.com>
wrote:

> Hi All,
>
> As of today Drill does not handle overflow detection and does not report
> that was an overflow to users, instead we just return results that are
> incorrect. This issue has been discussed (but not in detail) in the past.
>
> It would be great if Drill also handled overflow detection in data of type
> (int, bigint etc) like other existing DBMSs do. Users will not want to see
> incorrect/wrong results, instead an error that informs users that there was
> an overflow will make more sense.
>
> Here is an example of one such query that returns incorrect results as
> compared to Postgres. Difference in results (related to overflow detection
> problem), col1 is of type BIGINT
>
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> SELECT col1, AVG(SUM(col1)) OVER ( PARTITION
> BY col7 ORDER BY col0 ) FROM `allTypsUniq.parquet` GROUP BY col0,col1,col7;
> +----------------------+--------------------------+
> |         col1         |          EXPR$1          |
> +----------------------+--------------------------+
> | 5000                 | 5000.0                   |
> | 9223372036854775807  | -4.6116860184273853E18   |
> | 65534                | -3.0744573456182349E18   |
> | -1                   | -2.30584300921367629E18  |
> | 1                    | -1.84467440737094093E18  |
> | 17                   | -1.53722867280911744E18  |
> | 1000                 | -1.31762457669352909E18  |
> | 200                  | -1.15292150460683802E18  |
> | 4611686018427387903  | -5.1240955760303514E17   |
> | 1001                 | -4.6116860184273152E17   |
> | 30                   | -4.1924418349339232E17   |
> | -65535               | -65535.0                 |
> | 10000000             | 4967232.5                |
> | 0                    | 3311488.3333333335       |
> | 13                   | 2483619.5                |
> | 23                   | 1986900.2                |
> | 9999999              | 3322416.6666666665       |
> | 197                  | 2847813.8571428573       |
> | 9223372036854775806  | -1.1529215046043552E18   |
> | 92233720385475807    | -1.01457092404992947E18  |
> | 25                   | -9.1311383164493645E17   |
> | 3000                 | -8.3010348331357837E17   |
> +----------------------+--------------------------+
> 22 rows selected (0.46 seconds)
> {noformat}
>
> Results from Postgres
>
> {noformat}
> postgres=# SELECT col1, AVG(SUM(col1)) OVER ( PARTITION BY col7 ORDER BY
> col0 ) FROM fewrwspqq_101 GROUP BY col0,col1,col7;
>         col1         |          avg
> ---------------------+-----------------------
>                 5000 | 5000.0000000000000000
>  9223372036854775807 |   4611686018427390404
>                65534 |   3074457345618282114
>                   -1 |   2305843009213711585
>                    1 |   1844674407370969268
>                   17 |   1537228672809141060
>                 1000 |   1317624576693549623
>                  200 |   1152921504606855945
>  4611686018427387903 |   1537228672809137273
>                 1001 |   1383505805528223646
>                   30 |   1257732550480203317
>               -65535 |   -65535.000000000000
>             10000000 |  4967232.500000000000
>                    0 |  3311488.333333333333
>                   13 |  2483619.500000000000
>                   23 |  1986900.200000000000
>              9999999 |  3322416.666666666667
>                  197 |  2847813.857142857143
>  9223372036854775806 |   1152921504609338813
>    92233720385475807 |   1035067306362242923
>                   25 |    931560575726018634
>                 3000 |    846873250660017212
> (22 rows)
> {noformat}
>
> Thanks,
> Khurram
>