You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2020/12/28 02:26:00 UTC

[jira] [Resolved] (IMPALA-3244) Auto-update metadata after LOAD DATA moves files out of table directory

     [ https://issues.apache.org/jira/browse/IMPALA-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aman Sinha resolved IMPALA-3244.
--------------------------------
    Resolution: Won't Fix

Similar to the prior comment, I would question whether this is a valid use case for LOAD DATA statement.  This command is intended primarily for moving (raw) data from an hdfs file/directory into  an actual table rather than moving data from one table to another (for which insert select is more appropriate).  The description in our docs is:
{noformat}
The LOAD DATA statement streamlines the ETL process for an internal Impala table by moving a data file or all the data files in a directory from an HDFS location into the Impala data directory for that table.
{noformat}

This seems reasonably clear in that the source data is some arbitrary HDFS location which may or may not be a table. I am marking this as won't fix.


> Auto-update metadata after LOAD DATA moves files out of table directory
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-3244
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3244
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog, Frontend
>    Affects Versions: Impala 2.3.0
>            Reporter: John Russell
>            Priority: Minor
>
> In an ETL process, you might start with data files in table1 and then use LOAD DATA to move those data files over to table2. Even though the file operation was done by an Impala statement, Impala doesn't recognize that the source table is now empty:
> {code}
> Query: load data inpath '/user/impala/warehouse/weblogs.db/log_ingest_csv_staging' into table log_ingest_csv
> +------------------------------------------------------------+
> | summary                                                    |
> +------------------------------------------------------------+
> | Loaded 3 file(s). Total files in destination location: 178 |
> +------------------------------------------------------------+
> Fetched 1 row(s) in 0.30s
> Query: select count(*) from log_ingest_csv_staging
> WARNINGS: 
> Failed to open HDFS file hdfs://hostname:8020/user/impala/warehouse/blahblah.db/log_ingest_csv_staging/blahblah.csv
> Error(2): No such file or directory
> Query: refresh log_ingest_csv_staging
> Fetched 0 row(s) in 0.04s
> Query: select count(*) from log_ingest_csv_staging
> +----------+
> | count(*) |
> +----------+
> | 0        |
> +----------+
> Fetched 1 row(s) in 0.55s
> {code}
> Is it practical for LOAD DATA to look through the LOCATION attributes of all tables, see if any match the INPATH parameter (whether or not it's qualified with the hdfs: prefix), and if so finish with a REFRESH dbname.tablename operation for the source table?
> Bonus points if this also could be made to work if the LOAD DATA moved files out of a single partition, rather than all the files from the root directory of an unpartitioned table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)