You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2015/08/12 03:22:46 UTC

[jira] [Comment Edited] (HIVE-11245) LLAP: Fix the LLAP to ORC APIs

    [ https://issues.apache.org/jira/browse/HIVE-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692715#comment-14692715 ] 

Sergey Shelukhin edited comment on HIVE-11245 at 8/12/15 1:22 AM:
------------------------------------------------------------------

Most of the work was done in 3 sub-tasks.
1) 3 groups of things were added to storage API.
* DiskRange; ORC already depends on it, so it was an oversight on master that it was not moved to storage-api. It has been moved on llap branch.
* EncodedColumnBatch and MemoryBuffer. Same as moving VRB and *ColumnVector for encoded data.
* DataCache, Pool and Allocator APIs (the only import in any of them is MemoryBuffer, so they are very generic). The right place to implement format-agnostic cache, allocator, and object pool is Hive, and input formats can use these deep inside the core functionality, where Hive has no insight. Therefore it makes sense to have connective interfaces.

2) ....orc.encoded package was created with full separate path for "record reader", as discussed, although I don't think it was necessary. That required making some things in RecordReaderUtils, etc. public because Java visibility model is stupid.
It contains 9 files, most of which are very small.
* EncodedOrcFile - equivalent to OrcFile, static factory for Reader.
* Reader - interface, equivalent to orc.Reader, produces EncodedReader.
* EncodedReader - interface, equivalent to RecordReader (although not in signatures), for reading encoded data.
* Consumer - interface used in EncodedReader call to return data asynchronously (logically, a queue for returned data with "done" and "error" markers).
* OrcBatchKey, OrcCacheKey - simple DSes to use as keys when passing data and for cache.
* ReaderImpl - equivalent to orc.ReaderImpl, the Reader interface implementation.
* EncodedReaderImpl - equivalent to RecordReaderImpl (although not in signatures), main class that contains the code. Package-private, so it's not even visible.
* CacheChunk - part of EncodedReaderImpl that has to be visible for tests, so it's in separate file.

3) The remaining item is moving TreeReader bits that depend on orc.encoded package, into encoded package. Myself or [~prasanth_j] can do this.


was (Author: sershe):
Most of the work was done in 3 sub-tasks.
1) 3 groups of things were added to storage API.
a) DiskRange; ORC already depends on it, so it was an oversight on master that it was not moved to storage-api. It has been moved on llap branch.
b) EncodedColumnBatch and MemoryBuffer. Same as moving VRB and *ColumnVector for encoded data.
c) DataCache, Pool and Allocator APIs (the only import in any of them is MemoryBuffer, so they are very generic). The right place to implement format-agnostic cache, allocator, and object pool is Hive, and input formats can use these deep inside the core functionality, where Hive has no insight. Therefore it makes sense to have connective interfaces.

2) ....orc.encoded package was created with full separate path for "record reader", as discussed, although I don't think it was necessary. That required making some things in RecordReaderUtils, etc. public because Java visibility model is stupid.
It contains 9 files, most of which are very small.
* EncodedOrcFile - equivalent to OrcFile, static factory for Reader.
* Reader - interface, equivalent to orc.Reader, produces EncodedReader.
* EncodedReader - interface, equivalent to RecordReader (although not in signatures), for reading encoded data.
* Consumer - interface used in EncodedReader call to return data asynchronously (logically, a queue for returned data with "done" and "error" markers).
* OrcBatchKey, OrcCacheKey - simple DSes to use as keys when passing data and for cache.
* ReaderImpl - equivalent to orc.ReaderImpl, the Reader interface implementation.
* EncodedReaderImpl - equivalent to RecordReaderImpl (although not in signatures), main class that contains the code. Package-private, so it's not even visible.
* CacheChunk - part of EncodedReaderImpl that has to be visible for tests, so it's in separate file.

3) The remaining item is moving TreeReader bits that depend on orc.encoded package, into encoded package. Myself or [~prasanth_j] can do this.

> LLAP: Fix the LLAP to ORC APIs
> ------------------------------
>
>                 Key: HIVE-11245
>                 URL: https://issues.apache.org/jira/browse/HIVE-11245
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Owen O'Malley
>            Assignee: Sergey Shelukhin
>            Priority: Blocker
>
> Currently the LLAP branch has refactored the ORC code to have different code paths depending on whether the data is coming from the cache or a FileSystem.
> We need to introduce a concept of a DataSource that is responsible for getting the necessary bytes regardless of whether they are coming from a FileSystem, in memory cache, or both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)