You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/06/30 05:06:00 UTC

[jira] [Commented] (IMPALA-9515) Milestone 3: Reading “original files”

    [ https://issues.apache.org/jira/browse/IMPALA-9515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148314#comment-17148314 ] 

ASF subversion and git services commented on IMPALA-9515:
---------------------------------------------------------

Commit 930264afbdc6d309a30e2c7e1eef9fd7129ef29b in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=930264a ]

IMPALA-9515: Full ACID Milestone 3: Read support for "original files"

"Original files" are files that don't have full ACID schema. We can see
such files if we upgrade a non-ACID table to full ACID. Also, the LOAD
DATA statement can load non-ACID files into full ACID tables. So such
files don't store special ACID columns, that means we need
to auto-generate their values. These are (operation,
originalTransaction, bucket, rowid, and currentTransaction).

With the exception of 'rowid', all of them can be calculated based on
the file path, so I add their values to the scanner's template tuple.

'rowid' is the ordinal number of the row inside a bucket inside a
directory. For now Impala only allows one file per bucket per
directory. Therefore we can generate row ids for each file
independently.

Multiple files in a single bucket in a directory can only be present if
the table was non-transactional earlier and we upgraded it to full ACID
table. After the first compaction we should only see one original file
per bucket per directory.

In HdfsOrcScanner we calculate the first row id for our split then
the OrcStructReader fills the rowid slot with the proper values.

Testing:
 * added e2e tests to check if the generated values are correct
 * added e2e test to reject tables that have multiple files per bucket
 * added unit tests to the new auxiliary functions

Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953
Reviewed-on: http://gerrit.cloudera.org:8080/16001
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Milestone 3: Reading “original files”
> -------------------------------------
>
>                 Key: IMPALA-9515
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9515
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-acid
>
> “Original files” don’t store special ACID columns, that means we need to auto-generate those values. Actually we only need to auto-generate the record id: (originalTransaction, bucket, rowId).
>  * originalTransaction: can be parsed from the containing directory
>  ** If it’s the table root directory then originalTransaction is 0
>  * Bucket: it’s the bit-packed value of (bucket codec version, bucket id, and statement id)
>  ** Bucket codec version is 1
>  ** Bucket id can be parsed from the filename
>  ** Statement id can be parsed from the delta directory:
>  *** delta_<min_writeid>_<max_writeid>_<statement_id>
>  *** (min_writeid = max_writeid for original files)
>  * rowId: zero-based for each bucket, if there are multiple files in a single bucket:
>  ** List all the files belonging to the bucket
>  ** First file’s first row id is 0
>  ** Next file’s first row id is the row count of the first file
>  ** And so on
> The frontend should generate the base record ID for each file and propagate that information to the scanners. Therefore the scanners would know if they are scanning files in full ACID format or raw format. The ORC scanner needs to be changed in order to generate and fill the ACID columns for original files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org