You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/11/27 02:00:15 UTC

Apache Pinot Daily Email Digest (2021-11-26)

### _#general_

  
 **@pachirao7:** @pachirao7 has joined the channel  
 **@ashish:** Is it possible to have table sorted on a multi valued column?  
**@steotia:** No, we don't support sorted index on multi value column  
**@ashish:** I see. Is it possible to add this support?  
**@steotia:** I think settling on semantics of what sorted means for multi
valued column would be crucial and how the sort order will be leveraged during
filter processing. I am guessing for your string column, you sorted each
column array value and there is no order across column values. I suggest
creating an issue to start a discussion thread.  
**@ashish:** Will create an issue. Yes, I just sorted by column array value.  
**@steotia:** I can't see how sorting a multi valued column will exactly help
with performance. Dictionary and inverted index on a multi valued column is
created in same manner as for a single valued column. This implies that
predicate evaluation has no difference. It's the forward index that has
different format for multi valued column. If let's say we sort it, then
predicate evaluation across rows will still require inverted index. Anyway
lets create an issue and discuss there. I don't remember if this came up
before  
**@ashish:** Ok, maybe, I am stating my problem a bit differently. Here is the
original issue that I was trying to figure out - for a multi valued column,
the forward index is taking up a quite bit of space. I created a MV column
that has cardinality of 10, dictionary takes 4 bits per element, and each row,
say, contains exactly 5 elements in the array. For a segment with 100000 rows
(docs), it ends up taking (4*5+2)*100000/8 bytes. It seems Pinot stores array
of dictId + length in the forward index. I was expecting it to dictionary
encode the array of dictId, as cardinality of MV column is not 100000.  
**@ashish:** I am ok with creating a inverse index for query speed up but I am
mainly concerned with the space issue.  
**@ashish:** In the above example, each entry in the fwd index is taking up 3
bytes - while only 17 bits are need to represent 100,000 unique values and the
cardinality of the MV column is much lesser than the number of docs in the
segment.  
**@mayanks:** If you dict encode on array, then you size of dictionary will
become large. You should look at overall size of dict + fwd index for a column  
**@mayanks:** Could you please share what that overall size would be in your
case?  
**@mayanks:** I don’t think sorted index is the answer here. But would like to
understand your storage concern a bit more. What’s the overall data size and
is this really exploding your data size?  
**@ashish:** Here is what I did: I created two MV columns (string, dimension)
tags__KEYS, tags__VALUES and created another column seriesID is single value
with content equal to k1=v1,k2=v2, … So for a row if tags__KEYS = k1,k2 and
tags__VALUES=v1,v2, then seriesID=“k1=v1,k2=v2”.  
**@ashish:** The cardinality of tags__KEYS=10, cardinality of tags__VALUES=65,
cardinality of seriesID=10000. Number of docs=720000.  
**@ashish:** With this data, here are sizes for the dictionary and fwd
indices: index_map:tags__VALUES.forward_index.size = 7214060
index_map:tags__KEYS.forward_index.size = 4514060
index_map:seriesID.forward_index.size = 1260008
index_map:tags__KEYS.dictionary.size = 78
index_map:tags__VALUES.dictionary.size = 1048
index_map:seriesID.dictionary.size = 50008  
**@ashish:** So tags_KEYS and tags_VALUES end up taking 16 bytes per row,
while the seriesID takes up total of 1.81 bytes per row.  
**@ashish:** That’s a factor 9  
**@ashish:** This is blowing up the space requirements  
**@ashish:** Here the cardinality of seriesID should be same as array of
dictIds in tags_KEYS+tags+VALUES  
**@ashish:** If seriesID is treated as a dictionary id for
tags_KEYS+tags_VALUES then we can also have a sorted range index on it.  
**@ashish:** When I upload a segment sorted by seriesId, even the 1.81 bytes
per row usage for seriesId disappeared too!  
**@ashish:** That is what I want to achieve.  
**@ashish:** just to be clear, I am not storing the full k1=v1, k2=v2 in the
seriesID - just the serial number.  
**@mayanks:** Hmm, may be we should understand if the data modeling is
optimal?  
**@mayanks:** To close the thread: We sync'd up offline, and based on the
requirements, a better data model is to use JSON index, that also seems to
mitigate the storage issue.  
 **@ashish:** I tried to upload a segment with data sorted by a multi valued
string column but pinot server shows isSorted = false in the segment metadata.  
 **@sanketjoshi4:** @sanketjoshi4 has joined the channel  

###  _#random_

  
 **@pachirao7:** @pachirao7 has joined the channel  
 **@sanketjoshi4:** @sanketjoshi4 has joined the channel  

###  _#troubleshooting_

  
 **@pachirao7:** @pachirao7 has joined the channel  
 **@pachirao7:** Hi Team, we are using Apache Pinot for running OLAP queries
and right now one of the table is stuck in consumer rebalance state(while
pointing to a kafka topic) for last 12 hours. I tried restarting the Pinot
Servers which didn't help . Can someone please suggest any steps here . These
messages came in repeatedly ```[Consumer clientId=consumer-2,
groupId=event_template_mapping_REALTIME_1627646764788_0] Group coordinator <*>
(id: 795314267 rack: null) is unavailable or invalid, will attempt rediscovery
[Consumer clientId=consumer-2,
groupId=event_template_mapping_REALTIME_1627646764788_0] Discovered group
coordinator <*> (id: 795314267 rack: null) [Consumer clientId=consumer-2,
groupId=event_template_mapping_REALTIME_1627646764788_0] (Re-)joining group```
And finally after 6-7 hours saw this message , which basically didn't fetch
any partition . ```[Consumer clientId=consumer-4,
groupId=event_template_mapping_REALTIME_1627646764788_0] Successfully joined
group with generation 6 [Consumer clientId=consumer-4,
groupId=event_template_mapping_REALTIME_1627646764788_0] Setting newly
assigned partitions []```  
 **@sanketjoshi4:** @sanketjoshi4 has joined the channel  

###  _#pinot-dev_

  
 **@linjiabao001:** @linjiabao001 has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org