You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Jesus Camacho Rodriguez (JIRA)" <ji...@apache.org> on 2016/09/08 09:11:21 UTC

[jira] [Resolved] (HIVE-14468) Implement Druid query based input format

     [ https://issues.apache.org/jira/browse/HIVE-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jesus Camacho Rodriguez resolved HIVE-14468.
--------------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.2.0

Pushed in HIVE-14217.

> Implement Druid query based input format
> ----------------------------------------
>
>                 Key: HIVE-14468
>                 URL: https://issues.apache.org/jira/browse/HIVE-14468
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Druid integration
>    Affects Versions: 2.2.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>             Fix For: 2.2.0
>
>
> It is responsible of generating the splits and creating the record readers.
> * For *Timeseries*, *TopN*, *GroupBy* queries. Create a single split containing the broker address and the query. Then the record reader will submit the query to the broker, retrieve the results, and parse them and generate records.
> * For *Select* queries. Druid has the concept of threshold (limit) in Select query. In fact, it is used for retrieving the query results in multiple requests. Hence, we will emit a Druid Segment Metadata query to obtain the number of rows in the datasource. Then we create _number of rows / default\_threshold_ splits; _default\_threshold_ is a Hive configuration property defined as {{hive.druid.select.threshold}}. Each split generated contains the broker address and a Select JSON query with _start_ and _end_ date range (currently we assume uniform distribution of records across the time dimension). The splits are handled independently by the record readers, which submit the query to the broker, retrieve the results, and parse them and generate records. This way we can parallelize the retrieval of results for these queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)