You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by GitBox <gi...@apache.org> on 2018/07/22 21:00:12 UTC

[GitHub] datametrics opened a new issue #6035: quantile aggregation (99%) in druid from hive

datametrics opened a new issue #6035: quantile aggregation (99%) in druid from hive
URL: https://github.com/apache/incubator-druid/issues/6035
 
 
   Hi all together,
   
   i've got a question concerning data ingestion from inside hive.
   I use a query similar to this one for indexing a result view in Druid.
   
   use LONGFIELD_DWH;
   DROP TABLE IF EXISTS TestCube;
   EXPLAIN CREATE TABLE TestCube
   STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
   TBLPROPERTIES (
   "druid.segment.granularity" = "MONTH",
   "druid.query.granularity" = "DAY")
   AS
   SELECT
   cast(DimDate.Name as timestamp with local time zone) as __time,
   cast(DimInstrument.Name as string) InsName,
   FactTable.TestMeasure * p.Quantity as TestMeasure,
   cast(DimInstrument.InstrumentType as string) InstrumentType,
   cast(FactTable.IdScenario as string) Scenario,
   cast(p.Portfolio as string) Portfolio,
   p.Quantity as Quantity
   FROM FactTable
   JOIN DimDate on (FactTable.IdDate = DimDate.Id)
   JOIN DimInstrument on (FactTable.IdInstrument = DimInstrument.Id)
   JOIN DimScenario on (FactTable.IdScenario = DimScenario.Id)
   JOIN (select pos.Name, pos.IdInstrument, pos.IdPortfolio,pf.Name as Portfolio,
   pos.Quantity from FactPosition pos
   join DimInstrument ins on ins.Id = pos.IdInstrument
   join DimPortfolio pf on pf.Id = pos.IdPortfolio) p
   on DimInstrument.Id = p.IdInstrument;
   
   That works without any issues. The datasource gets published in druid and is visible
   in superset after querying datasource metadata. A segment is created for every day
   in the facttable.
   
   As part of my analysis i would like to build a postaggregated quantile measure like
   described on the project documentation page:
   
   At first i create a new metric in superset "HISTOGRAM" of type "approxHistogramFold" with this JSON:
   
   {
   "type" : "approxHistogramFold",
   "name" : "HISTOGRAM",
   "fieldName" : "sum__testmeasure",
   "resolution" : 500,
   "numBuckets" : 500,
   "lowerLimit" : 0.0
   }
   
   Next, i create another metric in superset "Response_99" (type = "postagg")that should display the
   fifth-worst (i.e. the 99% quantile) item from the list of 500 ordered testmeasure items
   from the approxhistogram measure (JSON):
   
   {
   "type": "quantile",
   "name": "HISTOGRAM",
   "fieldName": "HISTOGRAM",
   "probability": "0.99"
   }
   
   That measure combination runs but the output is "na". As far as I understand the documentation the field
   of the first aggregator for "testmeasure" has to be ingested with a specific metric-spec / ingestion
   aggregator similar to this:
   
   http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html
   
   Does anyone know how to incorporate this into the ingestion query issued in HIVE?
   If not, does anyone see how to mimic such quantile measures maybe without "approxHistogramFold" and
   probably in a simple javascript postaggregation?
   
   Thanks in advance!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org