You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/06/19 12:29:00 UTC

[jira] [Work logged] (HIVE-23729) LLAP text cache fails when using multiple tables/schemas on the same files

     [ https://issues.apache.org/jira/browse/HIVE-23729?focusedWorklogId=448435&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-448435 ]

ASF GitHub Bot logged work on HIVE-23729:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Jun/20 12:28
            Start Date: 19/Jun/20 12:28
    Worklog Time Spent: 10m 
      Work Description: szlta opened a new pull request #1150:
URL: https://github.com/apache/hive/pull/1150


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HIVE-XXXXX: Fix a typo in YYY)
   For more details, please see https://cwiki.apache.org/confluence/display/Hive/HowToContribute
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 448435)
    Remaining Estimate: 0h
            Time Spent: 10m

> LLAP text cache fails when using multiple tables/schemas on the same files
> --------------------------------------------------------------------------
>
>                 Key: HIVE-23729
>                 URL: https://issues.apache.org/jira/browse/HIVE-23729
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Ádám Szita
>            Assignee: Ádám Szita
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using the text based cache we will hit exceptions in the following case:
>  * Table A with 3 columns is defined on location X (where we have text based data files)
>  * Table B with 2 columns is defined on the same location X
>  * User runs a query on table A, thereby filling the LLAP cache.
>  * If the next query goes against table B that has a different schema, LLAP will throw an error:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
>  at org.apache.hadoop.hive.llap.cache.SerDeLowLevelCacheImpl.getCacheDataForOneSlice(SerDeLowLevelCacheImpl.java:411)
>  at org.apache.hadoop.hive.llap.cache.SerDeLowLevelCacheImpl.getFileData(SerDeLowLevelCacheImpl.java:389)
>  at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.readFileWithCache(SerDeEncodedDataReader.java:819)
>  at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.performDataRead(SerDeEncodedDataReader.java:720)
>  at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:274)
>  at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:271) {code}
> This is because the cache lookup is based on file ID, which in this case is the same for both tables. However, unlike with ORC files, the cached content and the file content is different, as it is dependent on the schema that was defined by the user. That's because the original text content is encoded into ORC in the cache.
> I think for the text cache case we will need to extend the cache key from being just the simple file ID to something that tracks the schema too. This will result in caching the *same* *file* *content* multiple times (if there are multiple schemas like this), however as we can see the *cached content itself could be quite different* (e.g. different streams with different encodings), and in turn we gain correctness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)