You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Mikko Kivistö (JIRA)" <ji...@apache.org> on 2018/09/17 08:34:00 UTC
[jira] [Updated] (HIVE-20574) Column statistics give erraneous
numDistinct
[ https://issues.apache.org/jira/browse/HIVE-20574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mikko Kivistö updated HIVE-20574:
---------------------------------
Description:
1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using some tool (aws cli, hdfs command or anything)
- S3: s3://www.smartdatahub.io/data/test.parquet
- HTTP: [http://www.smartdatahub.io/data/test.parquet]
- or the attachmen
eg. with aws cli, wget/curl/distcp can also be used
{\{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet .}}
{{{{hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }}
{{hdfs:///tmp/testi_parquet/test.parquet}}
2) Create table default.testi_parquet2 on top of that using the schema provided
{\{ CREATE TABLE `default.testi_parquet2`(}}
\{{ `rakennustu` int, }}
\{{ `kohdenimi` string, }}
\{{ `tekstisuun` int, }}
\{{ `tekstikoko` float, }}
\{{ `tekstifont` string, }}
\{{ `buix_bid` int, }}
\{{ `paivitetty` string, }}
\{{ `datanomist` string, }}
\{{ `geom_geojson` string, }}
\{{ `geom` binary, }}
\{{ `extractdate` string)}}
\{{ ROW FORMAT SERDE }}
\{{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }}
\{{ STORED AS INPUTFORMAT }}
\{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }}
\{{ OUTPUTFORMAT }}
\{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}}
\{{ LOCATION}}
\{{ 'hdfs:///tmp/testi_parquet/';}}
\{{ – CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM STEP 1 IF IT DIFFERS FROM THE EXAMPLE}}
3) To collect the values showing you the actual reality of the data: Query the distinct count, min and max of column "tekstisuun"
{\{SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM default.testi_parquet2; }}
and note them (min 0, max 0, distinct 1)
4) Compute statistics for the table using
{{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}}
5) See erroneous statistics entry for numDistincts: Query the statistics by using "
{{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}}
" and note the ERRANEOUS numDistincts value: 2
was:
1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using some tool (aws cli, hdfs command or anything)
- S3: s3://www.smartdatahub.io/data/test.parquet
- HTTP: [http://www.smartdatahub.io/data/test.parquet]
- or the attachmen
eg. with aws cli, wget/curl/distcp can also be used
{{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet .}}
{{hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}{{hdfs dfs -put test.parquet }}
{{hdfs:///tmp/testi_parquet/test.parquet}}
2) Create table default.testi_parquet2 on top of that using the schema provided
{{ CREATE TABLE `default.testi_parquet2`(}}
{{ `rakennustu` int, }}
{{ `kohdenimi` string, }}
{{ `tekstisuun` int, }}
{{ `tekstikoko` float, }}
{{ `tekstifont` string, }}
{{ `buix_bid` int, }}
{{ `paivitetty` string, }}
{{ `datanomist` string, }}
{{ `geom_geojson` string, }}
{{ `geom` binary, }}
{{ `extractdate` string)}}
{{ ROW FORMAT SERDE }}
{{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }}
{{ STORED AS INPUTFORMAT }}
{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }}
{{ OUTPUTFORMAT }}
{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}}
{{ LOCATION}}
{{ 'hdfs:///tmp/testi_parquet/';}}
{{ -- CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM STEP 1 IF IT DIFFERS FROM THE EXAMPLE}}
3) To collect the values showing you the actual reality of the data: Query the distinct count, min and max of column "tekstisuun"
{{SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM default.testi_parquet2; }}
and note them (min 0, max 0, distinct 1)
4) Compute statistics for the table using
{{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}}
5) See erroneous statistics entry for numDistincts: Query the statistics by using "
{{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}}
" and note the ERRANEOUS numDistincts value: 2
> Column statistics give erraneous numDistinct
> --------------------------------------------
>
> Key: HIVE-20574
> URL: https://issues.apache.org/jira/browse/HIVE-20574
> Project: Hive
> Issue Type: Bug
> Components: Metastore, Statistics
> Affects Versions: 2.3.2
> Environment: Amazon EMR (BigTop based) from emr-5.9.0 to emr-5.16.0.
> Reporter: Mikko Kivistö
> Priority: Major
> Labels: Statistics, statsCollection
>
> 1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using some tool (aws cli, hdfs command or anything)
> - S3: s3://www.smartdatahub.io/data/test.parquet
> - HTTP: [http://www.smartdatahub.io/data/test.parquet]
> - or the attachmen
> eg. with aws cli, wget/curl/distcp can also be used
> {\{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet .}}
> {{{{hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }}
> {{hdfs:///tmp/testi_parquet/test.parquet}}
> 2) Create table default.testi_parquet2 on top of that using the schema provided
> {\{ CREATE TABLE `default.testi_parquet2`(}}
> \{{ `rakennustu` int, }}
> \{{ `kohdenimi` string, }}
> \{{ `tekstisuun` int, }}
> \{{ `tekstikoko` float, }}
> \{{ `tekstifont` string, }}
> \{{ `buix_bid` int, }}
> \{{ `paivitetty` string, }}
> \{{ `datanomist` string, }}
> \{{ `geom_geojson` string, }}
> \{{ `geom` binary, }}
> \{{ `extractdate` string)}}
> \{{ ROW FORMAT SERDE }}
> \{{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }}
> \{{ STORED AS INPUTFORMAT }}
> \{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }}
> \{{ OUTPUTFORMAT }}
> \{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}}
> \{{ LOCATION}}
> \{{ 'hdfs:///tmp/testi_parquet/';}}
> \{{ – CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM STEP 1 IF IT DIFFERS FROM THE EXAMPLE}}
> 3) To collect the values showing you the actual reality of the data: Query the distinct count, min and max of column "tekstisuun"
> {\{SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM default.testi_parquet2; }}
> and note them (min 0, max 0, distinct 1)
> 4) Compute statistics for the table using
> {{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}}
> 5) See erroneous statistics entry for numDistincts: Query the statistics by using "
> {{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}}
> " and note the ERRANEOUS numDistincts value: 2
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)