You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/05/08 22:04:08 UTC
[GitHub] [incubator-druid] nosahama commented on issue #2523: Support multiple lookups within one namespace

nosahama commented on issue #2523: Support multiple lookups within one namespace
URL: https://github.com/apache/incubator-druid/issues/2523#issuecomment-490667638
 
 
   > As requested I am sharing our use case. We're using a TSV in S3 for a namespace lookup (at least to start with, we will probably switch over to a JDBC source eventually). We have a single key column, which always corresponds to the same actual dimension in Druid. We have a dozen lookup columns (could grow by a handful, but I'd think no more than 20). And we're starting pretty small now with only about 100K rows, but expect that could grow to several million rows before too long.
   > 
   > We don't need this updated really frequently. Actually we're still working out our ETLs and so forth to deal with revisions and additions to the lookup data. But I wouldn't expect us to have updates more frequently than hourly, and probably more like daily.
   > 
   > As far as pain points with this arrangement - there is sure plenty of boilerplate in the config. I have an array of a dozen entries in `druid.query.extraction.namespace.lookups` that are identical in all fields except for `namespace` and `valueColumn`. A bit clunky but not so much that I'd complain about it really - I did write a couple of simple scripts that generate the stuff to be placed in config.
   > 
   > I'm more concerned about the overhead when we do update the lookup source. Druid will have to load and parse this (potentially sized) 20 x 3M TSV once per lookup. I haven't done any benchmarking but I have noticed that it can take on the order of 15 seconds to completely load our current 12 x 100K case. Even if it takes a few minutes that is not a gamebreaker (assuming it does not interfere with query performance or produce inconsistent results while in progress). But it certainly seems like it could be a lot more efficient to load and parse the file once instead of 12 or 20 times.
   > 
   > Overall, the configuration and use feels a bit clunky, I think because from the user point of view, we have just one "lookup namespace" - there is a single source, and a single key column. It would feel more natural to define the data source level properties (uri, format, columns) and key column once, along with a list of allowed targetColumns, then use it in dimension specs and filters by referencing just the one single namespace plus a targetColumn. It might start to look like an ingestion spec at that point, with dataSchema- and ioConfig-like sections.
   > 
   > But honestly I don't know how much of a priority I'd want it to be. Associating a single namespace with a single key column and multiple value columns might well be overfitting to our specific case, and it 's certainly quite usable as it stands.
   > 
   > (One side note, the ability to include columns in the CSV which are not key or value columns is useful for assembling the data manually - we can include "friendly name" sort of columns that are helpful to people who are filling in or auditing the actual lookup data.)
   
   Hi there, please i am trying to configure Druid to load a lookup file from s3, how do i do this? Do i use the `file:/` syntax or there is another syntax for loading lookups from s3? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org