You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Oleksandr Chornyi <ol...@gmail.com> on 2019/06/07 11:50:29 UTC
Issues with calculating metrics and sorting on a float field in a stream

Hi guys!

I bumped into a couple of issues when trying to sort a stream or calculate
metrics on a Float field which contains values without the decimal part
(e.g 1.0, 0.0, etc.).

1. Issues with sorting. Consider this expression:

> sort(
>     list(
>        tuple(a=val(1.0)),
>        tuple(a=val(2.0)),
>        tuple(a=val(3.0))
>     ),
>     by="a desc"
> )

It executes sort just fine and returns

> "docs": [
>   {"a": 3},
>   {"a": 2},
>   {"a": 1}
> ]

The only minor issue at this point is that float numbers changed their
original type to integers. However, I'll get back to this later.
Now let's do a simple calculation over the same stream and try to sort it:

> sort(
>     select(
>         list(
>            tuple(a=val(1.0)),
>            tuple(a=val(2.0)),
>            tuple(a=val(3.0))
>         ),
>         div(a, 2) as a
>     ),
>     by="a desc"
> )

This expression returns "EXCEPTION": "java.lang.Long cannot be cast to
java.lang.Double". This happens because of the div() function which returns
different data types for different tuples. If you execute just the select
expression:

> select(
>     list(
>         tuple(a=val(1.0)),
>         tuple(a=val(2.0)),
>         tuple(a=val(3.0))
>     ),
>     div(a, 2) as a
> )

It will return tuples where field "a" will have mixed Long and Double data
types:

> "docs": [
>   {"a": 0.5},
>   {"a": 1},
>   {"a": 1.5}
> ]

This is why sort stumbles upon it.
I think that the root cause of this issue lies in the
RecursiveEvaluator#normalizeOutputType method which returns Long is a
BigDecimal value has zero scale:

} else if(value instanceof BigDecimal){
>   BigDecimal bd = (BigDecimal)value;
>   if(bd.signum() == 0 || bd.scale() <= 0 || bd.stripTrailingZeros().scale() <= 0){
>     try{
>       return bd.longValueExact();
>     } catch(ArithmeticException e){
>       // value was too big for a long, so use a double which can handle scientific notation
>     }
>   }
>   return bd.doubleValue();
> }

I consider this to be a major bug because even when your source stream
contains only Float/Double values, applying any arithmetic operation might
result in a value without decimal part which will be converted to Long that
will break sorting. Can you confirm that this is a bug, so that I'll create
a ticket?

2. The fact that Streaming Expressions engine heavily relies on the
assumption that a stream will contain numeric values of the same type leads
to subtle issues with calculating metrics. Consider this expression:

> rollup(
>     list(
>        tuple(a=val(1.1), g=1),
>        tuple(a=val(2), g=1),
>        tuple(a=val(3.1), g=1)
>     ),
>     over="g",
>     min(a),
>     max(a),
>     sum(a),
>     avg(a)
> )

(I showed earlier how you can get a stream of mixed types) It returns:

> {
>   "max(a)": 2,
>   "avg(a)": 0.6666666666666666,
>   "min(a)": 2,
>   "sum(a)": 2,
>   "g": "1"
> }

As you can see the results are wrong for all metrics. All metrics
considered only Long values from the source stream. In my case, it was
value '2'.
This happens because the implementation of all metrics holds separate
containers for Long and Double values. For example MaxMetric#getValue:

public Number getValue() {
>   if(longMax == Long.MIN_VALUE) {
>     return doubleMax;
>   } else {
>     return longMax;
>   }
> }

If a stream contained at least one Long among Doubles, the value of the
longMax container would be returned. I consider this a severe design flaw
and would like to get your perspective on this. Should I file a bug or I
miss something? Can I expect that this will be fixed at some point?

My ENV: solr-impl 7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan -
2019-02-23 02:39:07

Thank you in advance!
-- 
Best Regards,
Alex Chornyi