You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2018/06/19 17:44:00 UTC
[jira] [Assigned] (IMPALA-733) Improve Parquet error handling for low disk space

     [ https://issues.apache.org/jira/browse/IMPALA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong reassigned IMPALA-733:
------------------------------------

    Assignee:     (was: Henry Robinson)

> Improve Parquet error handling for low disk space
> -------------------------------------------------
>
>                 Key: IMPALA-733
>                 URL: https://issues.apache.org/jira/browse/IMPALA-733
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 1.2.3
>         Environment: Less than 1GB free on the filesystem where HDFS resides.
>            Reporter: John Russell
>            Priority: Minor
>
> If HDFS has less than 1 GB free (or I presume whatever value is set in the PARQUET_FILE_SIZE query option), INSERT into a Parquet table fails even for tiny amounts of data. That might be unavoidable, but the error should be communicated more clearly to the user.
> INSERT ... VALUES reports that N rows were inserted (no error at all), but the expected data is missing when the table is queried.
> INSERT ... SELECT gives a cryptic error message but still reports that the rows were inserted, although they aren't.
> Repro:
> About 400MB free. (This is a VM that keeps getting filled up by Impala-related logs.)
> $ df -k .
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/vda1             24607156  23961976    395184  99% /
> I was going to answer a question on the mailing list by showing an INSERT going from an unpartitioned to a partitioned table.
> [localhost:21000] > create table unpart (year int, s string) stored as parquet;
> Query: create table unpart (year int, s string) stored as parquet
> Returned 0 row(s) in 0.12s
> INSERT ... VALUES looks like it succeeds, but the data isn't really there.
> [localhost:21000] > insert into unpart values (2013,'Happy'),(2014,'New Year');
> Query: insert into unpart values (2013,'Happy'),(2014,'New Year')
> Inserted 2 rows in 0.22s
> [localhost:21000] > select * from unpart;
> Query: select * from unpart
> Returned 0 row(s) in 0.22s
> [localhost:21000] > select * from unpart;
> Query: select * from unpart
> Returned 0 row(s) in 0.22s
> Copying the data out of a text table, the error is reported but it doesn't say specifically "out of space". And the "Inserted 2 rows" message raises the hope the data made it in, but it didn't.
> [localhost:21000] > insert into unpart select * from t1;
> Query: insert into unpart select * from t1
> ERRORS ENCOUNTERED DURING EXECUTION: Backend 0:Failed to close HDFS file: hdfs://127.0.0.1:8020/user/hive/warehouse/partitioning.db/unpart/.impala_insert_staging/284cf98f761aec95_5712ef093b357195//.2903970254304242837-6274340053807624598_1840160694_dir/2903970254304242837-6274340053807624598_1083629803_data.0
> Error(255): Unknown error 255
> Inserted 2 rows in 0.34s
> [localhost:21000] > select * from unpart;
> Query: select * from unpart
> Returned 0 row(s) in 0.22s
> After all this, the data directory contains a leftover staging subdirectory (empty) and a zero-byte data file:
> $ hdfs dfs -ls hdfs://127.0.0.1:8020/user/hive/warehouse/partitioning.db/unpart
> Found 2 items
> drwxrwxrwx   - impala supergroup          0 2014-01-08 11:39 hdfs://127.0.0.1:8020/user/hive/warehouse/partitioning.db/unpart/.impala_insert_staging
> -rw-r--r--   1 impala supergroup          0 2014-01-08 11:39 hdfs://127.0.0.1:8020/user/hive/warehouse/partitioning.db/unpart/3188829493227009611-3605612775229973420_1967882694_data.0
> Suggestions:
> - Make INSERT ... VALUES detect/report the HDFS error trying to write the block. Don't report number of rows inserted.
> - Make INSERT ... SELECT error clearer, either suggest it could be out-of-space or do some followup check for $(PARQUET_FILE_SIZE) space free. Don't report number of rows inserted.
> - Be cleaner about leftover staging directories and empty data files. (Shouldn't the data file stay in the staging directory until it's successfully closed?)
> - Whatever distributed is checking is needed so the error is handled if it's a remote node that runs out of space, rather than the coordinator node like in this case with a single VM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org