You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Dave Birdsall <da...@esgyn.com> on 2016/04/06 20:28:18 UTC

ColumnStatistics getNumberOfValues

Hi,



I have a question about the getNumberOfValues() method of the
ColumnStatistics interface.



In the Hive documentation (for example, here:
https://hive.apache.org/javadocs/r0.12.0/api/org/apache/hadoop/hive/ql/io/orc/ColumnStatistics.html),
the method is described as returning “the number of values in this column”.
Under Method Detail, it says, “it will differ from the number of rows
because of NULL values and repeated values.”



My question concerns “repeated values”.



Being an SQL guy, I leap to the conclusion that getNumberOfValues() returns
the equivalent of “select count(distinct column) from orc_table”, that is,
the number of distinct values for that column in the table. (Well, for ORC
it is for a particular stripe of the table, but I hope my meaning gets
across.)



But when I experiment with this API, it seems to be returning the number of
non-null values instead. For example, using the Trafodion SQL engine to
query an example Hive table using ORC files, I see:



>>select s_rec_end_date from hive.hive.store2_orc order by s_rec_end_date;



S_REC_END_DATE

--------------



    1999-03-13

    1999-03-13

    2000-03-12

    2000-03-12

    2001-03-12

    2001-03-12

?

?

?

?

?

?



--- 12 row(s) selected.



But when I look at what ColumnStatistics.getNumberOfValues() returns for
this column, I get 6. (This particular example table has just one stripe.)
Looking at the values, though, there are just 3 distinct values here.



So, my question is: Is it the case that
ColumnStatistics.getNumberOfValues() returns the number of non-null values
in a column (in a given stripe)? And the Hive documentation is incorrect
when it mentions “repeated values”?



Thanks,



Dave

Re: ColumnStatistics getNumberOfValues

Posted by Owen O'Malley <om...@apache.org>.
It is the number of non-null values. The "and repeated values" is incorrect
and should be fixed.

.. Owen

On Wed, Apr 6, 2016 at 11:28 AM, Dave Birdsall <da...@esgyn.com>
wrote:

> Hi,
>
>
>
> I have a question about the getNumberOfValues() method of the
> ColumnStatistics interface.
>
>
>
> In the Hive documentation (for example, here:
> https://hive.apache.org/javadocs/r0.12.0/api/org/apache/hadoop/hive/ql/io/orc/ColumnStatistics.html),
> the method is described as returning “the number of values in this column”.
> Under Method Detail, it says, “it will differ from the number of rows
> because of NULL values and repeated values.”
>
>
>
> My question concerns “repeated values”.
>
>
>
> Being an SQL guy, I leap to the conclusion that getNumberOfValues()
> returns the equivalent of “select count(distinct column) from orc_table”,
> that is, the number of distinct values for that column in the table. (Well,
> for ORC it is for a particular stripe of the table, but I hope my meaning
> gets across.)
>
>
>
> But when I experiment with this API, it seems to be returning the number
> of non-null values instead. For example, using the Trafodion SQL engine to
> query an example Hive table using ORC files, I see:
>
>
>
> >>select s_rec_end_date from hive.hive.store2_orc order by s_rec_end_date;
>
>
>
> S_REC_END_DATE
>
> --------------
>
>
>
>     1999-03-13
>
>     1999-03-13
>
>     2000-03-12
>
>     2000-03-12
>
>     2001-03-12
>
>     2001-03-12
>
> ?
>
> ?
>
> ?
>
> ?
>
> ?
>
> ?
>
>
>
> --- 12 row(s) selected.
>
>
>
> But when I look at what ColumnStatistics.getNumberOfValues() returns for
> this column, I get 6. (This particular example table has just one stripe.)
> Looking at the values, though, there are just 3 distinct values here.
>
>
>
> So, my question is: Is it the case that
> ColumnStatistics.getNumberOfValues() returns the number of non-null values
> in a column (in a given stripe)? And the Hive documentation is incorrect
> when it mentions “repeated values”?
>
>
>
> Thanks,
>
>
>
> Dave
>