You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Jesus Camacho Rodriguez (JIRA)" <ji...@apache.org> on 2016/08/08 11:50:20 UTC

[jira] [Created] (HIVE-14468) Implement Druid query based input format

Jesus Camacho Rodriguez created HIVE-14468:
----------------------------------------------

             Summary: Implement Druid query based input format
                 Key: HIVE-14468
                 URL: https://issues.apache.org/jira/browse/HIVE-14468
             Project: Hive
          Issue Type: Sub-task
          Components: Druid integration
    Affects Versions: 2.2.0
            Reporter: Jesus Camacho Rodriguez
            Assignee: Jesus Camacho Rodriguez


It is responsible of generating the splits and creating the record readers.

* For *Timeseries*, *TopN*, *GroupBy* queries. Create a single split containing the broker address and the query. Then the record reader will submit the query to the broker, retrieve the results, and parse them and generate records.

* For *Select* queries. Druid has the concept of threshold (limit) in Select query. In fact, it is used for retrieving the query results in multiple requests. Hence, we will emit a Druid Segment Metadata query to obtain the number of rows in the datasource. Then we create _number of rows / default\_threshold_ splits; _default\_threshold_ is a Hive configuration property defined as {{hive.druid.select.threshold}}. Each split generated contains the broker address and a Select JSON query with _start_ and _end_ row. The splits are handled independently by the record readers, which submit the query to the broker, retrieve the results, and parse them and generate records. This way we can parallelize the retrieval of results for these queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)