You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/12/13 02:00:18 UTC

Apache Pinot Daily Email Digest (2021-12-12)

### _#general_

  
 **@karinwolok1:** Reminder - tomorrow, Monday, December 13 - Apache Pinot
2021 recap and future roadmap discussion! :call_me_hand:  :pencil2: Vote on
features and improvements you'd like to see in Apache Pinot 2022! :pencil2:  
**@diogo.baeder:** One more question, folks: when it comes to segments of
~200M in size, what segment storage technology would you recommend using when
running a cluster in AWS? HDFS? S3? EFS mounted?  
**@g.kishore:** Ebs  
**@mayanks:** Yes, for local storage attached to serving nodes you can use
EBS. For deep store you can use S3.  
**@ken:** @diogo.baeder - you can also use HDFS for deep store.  
**@ken:** @g.kishore do you know of any Pinot performance comparisons of EBS
vs local SSDs?  
**@g.kishore:** Nothing in a presentable form.  
**@diogo.baeder:** Thanks guys, but which one of those options do you think
that gives us the best performance, say, in a scenario of having something
like up to 10T in data?  
**@g.kishore:** What’s your qps and latency expectation  
**@g.kishore:** The only options are local ssd or ebs or efs  
**@g.kishore:** S3 hdfs options are only applicable to deepstore which is a
backup segment store and will not be accessed during query time  
**@diogo.baeder:** QPS up to a few dozens at max, latency can be seconds but
preferably under 1 minute. Thanks for the info, man!  
**@mayanks:** Yeah, you definitely don’t need local SSD for this. As Kishore
mentioned, any of the options for network attached disk on serving node will
work.  
**@diogo.baeder:** Ah, awesome, thank you guys!  
**@mayanks:** Since the latency is not too tight, you might want to pack a lot
of data per instance, so EBS for serving nodes seems good. For deep store, S3
or HDFS both work (S3 is more popular in my personal experience).  
**@diogo.baeder:** Got it. I'll take that into consideration, and also
probably go for S3 for the deep store backups (since we already use it a lot
for other things)  
**@ashwinviswanath:** If you want latency in seconds ideally, have you
considered Hudi?  
**@diogo.baeder:** Not really; I'm not sure what role that would play when
integrated to Pinot however  
 **@ashish:** Is there any way to extract more than one fields from a json
column? jsonextractscalar only allows one field at a time. So, if I do select
jsonextractscala(jsonColumn, ‘field1’), jsonextractscalar(jsonColumn,
‘field2’), will it result in parsing the json document twice for each doc/row?  
**@g.kishore:** Parsing will probably happen twice but reading from disk will
happen only once  
**@ashish:** Tried various things and figured one could do this: select
jsonextractscalar(jsoncolumn, ‘$[“f1”, “f2"]’)  
 **@ashish:** There does not seem to be a way to exclude properties in json
path expression used by jsonextractscalar. I guess, only way seems to be write
my own jsonextractscalars that calls json
parser.delete(propertiesToDelete).read(propertiesToFetch) is my understanding
right? Any other suggestions?  
**@g.kishore:** Do you have an example of what you are trying to accomplish  
**@ashish:** Basically, the json column is flat map of string -> string and I
am trying to do a group by in two different ways: 1\. group by key1, key2 2\.
group by all other keys after excluding key1 and key2 where key1 and key2 are
the field key names in the flat map like json column The key names depend on
the filter being used - so I cannot convert key1 and key2 to static columns.  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org