You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Prasanth J (JIRA)" <ji...@apache.org> on 2014/02/18 01:02:19 UTC

[jira] [Commented] (HIVE-6449) EXPLAIN has diffs in Statistics in tests generated on Windows vs. test generated on Linux

    [ https://issues.apache.org/jira/browse/HIVE-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903610#comment-13903610 ] 

Prasanth J commented on HIVE-6449:
----------------------------------

Hi Resmus

One reason for this to happen is that parquet SerDe does not implement SerDeStats interface or parquet record writers does not implement StatsProvidingRecordWriter interface. The implementation of these interfaces are required for gathering raw data size. Statistics in explain will try to use the raw data size from the metastore. Raw data size should not be dependent on the operating system since its equivalent deserialized row size * number of rows. So I believe that parquet does not implement these interface and hence do not provide raw data size, in which case, file size is shown as the "Data size:". If the file size return by metastore or returned by filesystem.getContentSummary() api call is different then the statistics reported will be different. My suspicion is that the file sizes for the table are different for Windows vs Linux. Can you verify if the file size in windows is same as the file size in linux?

> EXPLAIN has diffs in Statistics in tests generated on Windows vs. test generated on Linux
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-6449
>                 URL: https://issues.apache.org/jira/browse/HIVE-6449
>             Project: Hive
>          Issue Type: Bug
>          Components: Tests
>            Reporter: Remus Rusanu
>            Assignee: Remus Rusanu
>            Priority: Critical
>
> When .q.out files are generated on Windows the statistics in EXPLAIN differ from ones generated on Linux. Eg:
> {code}
> Running: diff -a /root/hive/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/vectorized_parquet.q.out /root/hive/itests/qtest/../../ql/src/test/results/clientpositive/vectorized_parquet.q.out
> 72c72
> <             Statistics: Num rows: 12288 Data size: 73728 Basic stats: COMPLETE Column stats: NONE
> ---
> >             Statistics: Num rows: 2072 Data size: 257046 Basic stats: COMPLETE Column stats: NONE
> 75c75
> <               Statistics: Num rows: 6144 Data size: 36864 Basic stats: COMPLETE Column stats: NONE
> ---
> >               Statistics: Num rows: 1036 Data size: 128523 Basic stats: COMPLETE Column stats: NONE
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)