You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Joe McDonnell (JIRA)" <ji...@apache.org> on 2017/10/26 15:46:00 UTC

[jira] [Resolved] (IMPALA-6068) Dataload does not populate functional_*.complextypes_fileformat correctly

     [ https://issues.apache.org/jira/browse/IMPALA-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joe McDonnell resolved IMPALA-6068.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.11.0

commit e4f585240ac8f478e25402806f4ea38531b4bf84
Author: Joe McDonnell <jo...@cloudera.com>
Date:   Fri Oct 20 11:41:59 2017 -0700

    IMPALA-6068: Fix dataload for complextypes_fileformat
    
    Dataload typically follows a pattern of loading data into
    a text version of a table, and then using an insert
    overwrite from the text table to populate the table for
    other file formats. This insert is always done in Impala
    for Parquet and Kudu. Otherwise it runs in Hive.
    
    Since Impala doesn't support writing nested data, the
    population of complextypes_fileformat tries to hack
    the insert to run in Hive by including it in the ALTER
    part of the table definition. ALTER runs immediately
    after CREATE and always runs in Hive. The problem is
    that ALTER also runs before the base table
    (functional.complextypes_fileformat) is populated.
    The insert succeeds, but it is inserting zero rows.
    
    This code change introduces a way to force the Parquet
    load to run using Hive. This lets complextypes_fileformat
    specify that the insert should happen in Hive and fixes
    the ordering so that the table is populated correctly.
    
    This is also useful for loading custom Parquet files
    into Parquet tables. Hive supports the DATA LOAD LOCAL
    syntax, which can read a file from the local filesystem.
    This means that several locations that currently use
    the hdfs commandline can be modified to use this SQL.
    This change speeds up dataload by a few minutes, as it
    avoids the overhead of the hdfs commandline.
    
    Any other location that could use DATA LOAD LOCAL is
    also switched over to use it. This includes the
    testescape* tables which now print the appropriate
    DATA LOAD commands as a result of text_delims_table.py.
    Any location that already uses DATA LOAD LOCAL is also
    switched to indicate that it must run in Hive. Any
    location that was doing an HDFS command in the LOAD
    section is moved to the LOAD_DEPENDENT_HIVE section.
    
    Testing: Ran dataload and core tests. Also verified that
    functional_parquet.complextypes_fileformat has rows.
    
    Change-Id: I7152306b2907198204a6d8d282a0bad561129b82
    Reviewed-on: http://gerrit.cloudera.org:8080/8350
    Reviewed-by: Joe McDonnell <jo...@cloudera.com>
    Tested-by: Impala Public Jenkins


> Dataload does not populate functional_*.complextypes_fileformat correctly
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-6068
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6068
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 2.10.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Critical
>             Fix For: Impala 2.11.0
>
>
> functional.complextypes_fileformat is a text table containing some nested data.
> Data load is supposed to generate functional.complextypes_fileformat in this order:
> 1. Create table functional.complextypes_fileformat
> 2. Populate functional.complextypes_fileformat using
> INSERT OVERWRITE TABLE {db_name}{db_suffix}.{table_name} SELECT id, named_struct("f1",string_col,"f2",int_col), array(1, 2, 3), map("k", cast(0 as bigint)) FROM functional.alltypestiny;
> 3. Create tables functional_*.complextypes_fileformat
> 4. Populate each table using:
> INSERT OVERWRITE TABLE {table_name} SELECT * FROM functional.{table_name};
> However, dataload is doing this in the wrong order. It does #1, #3, #4, and then finally #2. This means that #4 is operating on zero rows, so all the functional_*.complextypes_fileformat tables have zero rows. Oddly enough, dataload also generates #4 to insert into functional.complextypes_fileformat so it is overwriting itself using rows from itself. Dataload should do this in the correct order (and avoid this weirdness). 
> This is only used for frontend tests, but it can cause issues with recent versions of Hive, because Hive seems to skip creating a file when it would be writing zero rows. That can alter the number of files listed in the plan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)