You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/07/07 02:00:20 UTC

Apache Pinot Daily Email Digest (2021-07-06)

### _#general_

  
 **@benjamin.djidi:** @benjamin.djidi has joined the channel  
 **@trustokoroego:** @trustokoroego has joined the channel  
 **@trustokoroego:** Hi Everyone :wave:  
 **@alvaradojl1986:** @alvaradojl1986 has joined the channel  
 **@karinwolok1:** Hey all! Join us for this meetup today! Starting in 1.5
hours! Presentations by @elon.azoulay and @jackie.jxt  
**@karinwolok1:** In case you missed the meetup, you can watch it here!
Slides from @elon.azoulay 's presentation are also available in the
description!  
 **@ken:** We generate OFFLINE segments via Hadoop, and sometimes these are
updates to existing segments. In that case we want the segment names to match
exactly (so that it’s an update). For most segments this is fine, as we
partition by month. But there are cases where we also sub-partition by a non-
date field. In this situation I don’t see a way to leverage the
`SegmentNameGenerator` interface to give us a deterministic name. If we could
key off of the input (CSV) file name then it would be easy, as we’ve got full
control over that. Any ideas?  
**@mayanks:** For REFRESH tables (which don't have time column), the segment
naming scheme is something like <tableName>_idx. Does that not work?  
**@mayanks:** BTW, there's an issue opened recently about the exact same
requirement as yours  
**@mayanks:** Looking for contributions :wink:  
**@ken:** No, because our segment names will be something like `<table
name>_<country>_YYYY-MM` but for the US it’s `<table name>_us_YYYY-MM_idx`,
e.g. `ads_us_2020-08_0`  
**@ken:** For cases where we don’t have that final index (sub-partition) it’s
easy to ensure exact name matching. But with the US data, we need to sub-
partition by a field we use frequently in star tree indexes, so that we get
maximum gain.  
**@ken:** Thanks for the ref to the issue - yes, this is very similar to what
we need.  
**@ken:** Added some questions to the issue you referenced.  
 **@joshhighley:** if a table exists for multiple tenants, is it possible to
restrict query results to a single tenant?  
**@mayanks:** What do you mean by tenant here?  
**@joshhighley:** the Tenant component of Pinot  
**@joshhighley:** we need to segregate client data  
**@joshhighley:** well, looking at docs, can I specify multiple tenants when
creating a table? ```"tenants": { "broker": "myBrokerTenant", "server":
"myServerTenant" },```  
**@mayanks:** A table can only have one tenant for server and one for broker.
A tenant can be shared across tables  
**@joshhighley:** well, dang. So if we need to segregate data by client
(tenant) then each table requires a unique name?  
**@mayanks:** No it does not  
**@mayanks:** You can have a single table on single tenant and have all
clients data on the same table?  
**@joshhighley:** no -- our clients don't like their data mixed.  
**@mayanks:** Then have separate table per client?  
**@joshhighley:** each client needs their own 'customer' table, as an example  
**@mayanks:** Yeah so 1 client - 1 table - 1 tenant if you want to complete
separation  
**@mayanks:** Not a scalable mode perhaps  
**@mayanks:** But seems like that is what your customers are asking for  
**@joshhighley:** no, without multi-tenancy, each client would have their own
environment. Each environment would have the same tables.  
**@mayanks:** What’s is an environment? Helix cluster? If so, then two helix
clusters are completely air gapped and you are fine  
**@joshhighley:** our hope was to have TenantA on BrokerA and ServerA with
table 'Customers'. Then also, TenantB on BrokerB and ServerB with table
'Customers'...  
**@joshhighley:** I was using 'environment' in a general sense: a set of
servers.  
**@mayanks:** It me it sounds like separate tables? If so, why does the name
of table need to be same?  
**@mayanks:** Because customers may end up having their own schema as well in
future?  
**@mayanks:** Note that you cannot have multiple tables with same name in one
cluster  
**@joshhighley:** because we have 100s of clients. Managing tables
Customer_ClientA, Customer_ClientB, Customer_ClientC gets very cumbersome  
**@joshhighley:** there's lots of tables for each customer also  
**@mayanks:** I think you want same table across all customers but then no two
customers can share the same set of brokers/servers?  
**@joshhighley:** right. Their data needs to be kept separate  
**@mayanks:** That is also not scalable if you have 100's of customers. For
durability, you will end up having 3 brokers + 3 servers per customer,
regardless of what amount of data they have.  
**@mayanks:** One way is to partition the data on customerId. But that will
segregate at partition level and not customer level.  
**@mayanks:** Perhaps customers really want is customer level ACL?  
**@mayanks:** If so, that can be built on a mid-tier layer on top of single
table in Pinot?  
**@joshhighley:** our customers are financial companies -- mixing data across
those companies isn't an option  
**@mayanks:** What you are trying to use the tenant concept in Pinot is not
what it is meant for, and doesn't solve your problem.  
**@mayanks:** A table in Pinot can only have one tenant for server and one for
broker  
 **@pablomolnar:** @pablomolnar has joined the channel  
 **@karinwolok1:** Don't miss these 3 awesome meetups next week: Presenters:
@jackie.jxt @mayanks @kennybastani @tingchen and Gunnar Morling!  
**@karinwolok1:**  
 **@karinwolok1:**  
 **@yhao:** @yhao has joined the channel  
 **@b.gilbert:** @b.gilbert has joined the channel  

###  _#random_

  
 **@benjamin.djidi:** @benjamin.djidi has joined the channel  
 **@trustokoroego:** @trustokoroego has joined the channel  
 **@alvaradojl1986:** @alvaradojl1986 has joined the channel  
 **@pablomolnar:** @pablomolnar has joined the channel  
 **@yhao:** @yhao has joined the channel  
 **@b.gilbert:** @b.gilbert has joined the channel  

###  _#feat-text-search_

  
 **@b.gilbert:** @b.gilbert has joined the channel  

###  _#troubleshooting_

  
 **@benjamin.djidi:** @benjamin.djidi has joined the channel  
 **@trustokoroego:** @trustokoroego has joined the channel  
 **@prashant.pandey:** Hi. We have a K8s Pinot deployment and some of our
queries are taking > 10s. We found one conspicuous correlation during our
investigation - Latency spikes happen when there is also a spike a YG GC
count. In the following charts, spikes happened across the board at 15:28.
Does this indicate a possible GC issue?  
**@mayanks:** Need more info. Is this server side? What’s the read qps, and
data size on server? What’s the heap size? What kind of queries  
**@mayanks:** What version of Java  
**@alvaradojl1986:** @alvaradojl1986 has joined the channel  
 **@pablomolnar:** @pablomolnar has joined the channel  
 **@yhao:** @yhao has joined the channel  
 **@b.gilbert:** @b.gilbert has joined the channel  

###  _#pinot-dev_

  
 **@atri.sharma:** @mayanks @g.kishore I am looking to support nulls in
aggregates (a common use case for us). Is there a place where I can get prior
thoughts and research, and potential starting ideas?  
**@mayanks:** @atri.sharma there has been some work done with null support in
the past, perhaps we can start from where that discussion ended cc @jackie.jxt
@chinmay.cerebro  
**@jackie.jxt:** @atri.sharma Does putting a null filter work for your use
case? E.g. `SELECT SUM(col) FROM table WHERE col IS NOT NULL`?  
**@jackie.jxt:** The main reason why we didn't directly support nulls in
aggregates is because of the performance overhead of per-value null check, and
forcing us to use `Object[]` instead of primitive array  
 **@madhu.sling:** @madhu.sling has joined the channel  

###  _#community_

  
 **@vaibhav.mital:** @vaibhav.mital has joined the channel  
 **@b.gilbert:** @b.gilbert has joined the channel  

###  _#announcements_

  
 **@b.gilbert:** @b.gilbert has joined the channel  

###  _#multiple_streams_

  
 **@b.gilbert:** @b.gilbert has joined the channel  

###  _#presto-pinot-connector_

  
 **@ojasmulay:** @ojasmulay has joined the channel  

###  _#pinot-perf-tuning_

  
 **@b.gilbert:** @b.gilbert has joined the channel  

###  _#getting-started_

  
 **@madhu.sling:** @madhu.sling has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org