You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "David Wayne Birdsall (JIRA)" <ji...@apache.org> on 2016/09/29 22:55:22 UTC
[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

    [ https://issues.apache.org/jira/browse/TRAFODION-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534336#comment-15534336 ] 

David Wayne Birdsall commented on TRAFODION-2251:
-------------------------------------------------

This is proving tricky to fix.

The reason the sample table creation fails is that to create the sample table, we use "CREATE TABLE LIKE". And CREATE TABLE LIKE will simply use the datatypes of the source columns to create the target. Unfortunately we support larger char/varchar columns with Hive tables than we do with native Trafodion tables (the sample table is created as a native Trafodion table).

One choice might be to relax all restrictions on char/varchar column lengths in Trafodion, but this might be a lot of work.

So instead we could do as the description above suggests: prohibit UPDATE STATS on long char/varchar columns, in the same way that we prohibit it for LOB data types. Unfortunately, we still hang up here on the "CREATE TABLE LIKE" business, because we create the sample table with *all* the columns from the source table, supported or not. It works for LOBs because they are supported in native Trafodion tables.

So we are faced with some choices:
1. Replace the ustat-level "CREATE TABLE LIKE" logic with something that picks just the supported columns. Note that the logic to populate the table will need to change accordingly.
2. Change the CREATE TABLE LIKE DDL code to check column length, and if it is an unsupported length, use the maximum supported length instead. Note that this also implies a change to the populate logic, as we'll have to do a SUBSTRING operation in order to avoid errors when writing rows.

> update stats does not work for large Hive varchar columns
> ---------------------------------------------------------
>
>                 Key: TRAFODION-2251
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2251
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: 2.1-incubating
>         Environment: All
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>
> As shown below, update statistics currently returns various errors once the length of a column exceeds 200000. But with the large varchar support, theoretically the total row size of 64MB is the only limit. A long varchar column length can easily go over the 200000 limit.
> It's not clear if we intend to support update statistics for large varchar. If we don't plan to support it, perhaps update statistics can at least skip long varchar columns (much like the way it skips LOB columns right now), so that it does not fail when a user runs update statistics on 'every column' that includes long varchars.
> The following sequence of execution also showed a second problem, which may or may not be related. The first update statistics had failed with 'every column' which includes long varchar columns. Showstats didn't show any stats aferwards. The second update statistics was run on the long varchar columns again, but it didn't return any errors. Showstats afterwards showed no stats still.
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ hive -f hive.sql
> WARNING: Use "yarn jar" to launch YARN applications.
> Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
> OK
> Time taken: 1.455 seconds
> OK
> Time taken: 0.185 seconds
> OK
> Time taken: 0.304 seconds
> OK
> Time taken: 0.371 seconds
> Loading data to table mysch.mytable
> Table mysch.mytable stats: [numFiles=1, totalSize=20971526]
> OK
> Time taken: 0.967 seconds
> $ sqlci -i mytest.sql
> EsgynDB Advanced Conversational Interface 2.2.0
> Copyright (c) 2015-2016 Esgyn Corporation
> >>log mytest.log clear;
> >>cqd HIVE_MAX_STRING_LENGTH '10485760';
> --- SQL operation complete.
> >>set schema hive.mysch;
> --- SQL operation complete.
> >>drop external table if exists mytable for hive.mysch.mytable;
> --- SQL operation complete.
> >>create external table mytable (c_int int, c_string1 varchar(10485760), c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on every column sample random 50 percent;
> *** ERROR[9214] Object TRAFODION."_HIVESTATS_".TRAF_SAMPLE_6638414188583073746_1473710297_39124 could not be created.
> *** ERROR[4247] Specified size in bytes (10485760) exceeds the maximum size allowed (200000) for column C_STRING1.
> *** ERROR[9200] UPDATE STATISTICS for table HIVE.MYSCH.MYTABLE encountered an error (8609) from statement Process_Query.
> *** ERROR[8609] Waited rollback performed without starting a transaction.
> --- SQL operation failed with errors.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on c_int to c_string2 sample random 50 percent;
> --- SQL operation complete.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> To reproduce, create 3 files with the following contents and run them in this order:
> chmod 755 ./data_gen.py
> ./data_gen.py data.1row_10MB.txt 1 2 10485760
> hive -f hive.sql
> sqlci -i mytest.sql
> $ cat data_gen.py
> #! /usr/bin/env python
> import sys
> if len(sys.argv) != 5 or \
>    sys.argv[1].lower() == '-h' or \
>    sys.argv[1].lower() == '-help':
>     print 'Usage: ' + sys.argv[0] + ' <file> <num of rows> <num of varchar columns> <varchar column length>'
>     sys.exit()
> f = open(sys.argv[1], "w+")
> marker=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
> for num_rows in range(0, int(sys.argv[2])):
>     f.write(str(num_rows) + '|')
>     for num_cols in range(0, int(sys.argv[3])):
>         f.write(marker[num_rows%len(marker)])
>         for i in range (1, int(sys.argv[4])):
>             f.write(str(i % 10))
>         f.write('|')
>     f.write(str(num_rows))
>     f.write('\n')
> f.close()
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ cat mytest.sql
> log mytest.log clear;
> cqd HIVE_MAX_STRING_LENGTH '10485760';
> set schema hive.mysch;
> drop external table if exists mytable for hive.mysch.mytable;
> create external table mytable (c_int int, c_string1 varchar(10485760), c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> update statistics for table mytable on every column sample random 50 percent;
> showstats for table mytable on c_int to c_string2;
> update statistics for table mytable on c_int to c_string2 sample random 50 percent;
> showstats for table mytable on c_int to c_string2;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)