You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Csaba Ringhofer (JIRA)" <ji...@apache.org> on 2018/06/15 13:32:00 UTC

[jira] [Created] (IMPALA-7178) Reduce logging for common data errors

Csaba Ringhofer created IMPALA-7178:
---------------------------------------

             Summary: Reduce logging for common data errors
                 Key: IMPALA-7178
                 URL: https://issues.apache.org/jira/browse/IMPALA-7178
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Csaba Ringhofer
            Assignee: Csaba Ringhofer


Some data errors (for example out-of-range parquet timestamps) can dominate logs if a table contains a large number of rows with invalid data. If an error has its own error code (see common/thrift/generate_error_codes.py), then these errors are already aggregated to the user (RuntimeState::LogError()) for every query, but the logs will contain a new line for every occurrence. This not too useful most of times, as the log lines will repeat  the same information (the corrupt data itself is not logged as it can be sensitive information).

The best would to reduce logging without loosing information:
- the first occurrence of an error should be logged (per query/fragment/table/file/column) to help investigation of cases where the data error leads to other errors and to avoid breaking log analyzer tools that search for the current format
- other occurrences can be aggregated, like "in query Q table T column C XY error occurred N times"

An extra goal is to avoid calling RuntimeState::LogError() for other occurrences than the first one, as RuntimeState::LogError() uses a lock.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org