You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/01/06 02:00:12 UTC

Apache Pinot Daily Email Digest (2021-01-05)

### _#general_

  
 **@wrbriggs:** Sorry to be a never-ending fount of questions, folks… is it
expected / necessary to create a rangeIndex on dateTime fields, or are those
automatically indexed efficiently? Likewise, should I add dateTime fields to
the noDictionaryColumns list?  
**@mayanks:** What's your time granularity?  
**@mayanks:** Typically, we don't need to set explicit indexing for dateTime
fields, as we can still prune segments based on metadata.  
**@wrbriggs:** My base timestamp is epoch milliseconds, I am playing around
with deriving an hourly grain field for pre-aggregating in a star-tree index,
but based on my current prototype, that seems like it might be premature
optimization  
**@mayanks:** A general recommendation is to sort on primary key (or a
dimension that appears in most queries), and minimal number of inv indexing to
have a reasonable selectivity across your query set.  
**@wrbriggs:** :thumbsup:  
**@wrbriggs:** Thank you. I am also looking to partition the incoming data
based on a dimension that is almost always used selectively in the WHERE
clause, and use broker-side partition pruning to minimize scanning unnecessary
segments - I’m not sure if that will actually help, or if forcing data
locality like that will bottleneck things.  
**@wrbriggs:** I am using that same dimension as my sort key, but right now,
it’s not particularly useful to sort on it, because it shows up in all
segments  
**@mayanks:** Oh yeah, partitioning is definitely a good idea.  
**@mayanks:** In our usecases, we typically have the partitioning as well as
sorting on the same dimension  
**@wrbriggs:** Ok, that makes me feel better, as that was my plan.  
**@mayanks:** Good plan, I'd say.  
**@wrbriggs:** Another stupid question - should I create an inverted index on
the sort column, or is that unnecessary?  
**@mayanks:** That is unnecessary, it won't be used. In fact, the segment
generation might just ignore and not create.  
**@wrbriggs:** Perfect, thank you. It seemed like it would be unnecessary, but
I’ve seen stranger things, and the docs, while great for an incubating
project, were a little unclear - I would love to volunteer to keep notes while
I’m doing this, and maybe propose some updates to the docs if that would be
helpful.  
**@mayanks:** That would be really awesome, would really appreciate your help
in improving our docs.  
**@g.kishore:** @wrbriggs you might find this video useful  
**@g.kishore:** it talks about all the indexing techniques and when to use
what.  
**@wrbriggs:** That’s awesome, thank you  

###  _#discuss-validation_

  
 **@chinmay.cerebro:** @mayanks @ssubrama @snlee: not sure if you got time to
review the table config validation schema created by @mohammedgalalen056:  
**@chinmay.cerebro:** Please review when you get some time  
 **@chinmay.cerebro:** I think there are cases where this might break things.
For eg: `"replication": { "type": "string" }` : I've seen cases where we can
use `"replication": 3` -> this is flagged by the validation since we expect a
string  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org