You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/05/06 02:00:29 UTC
Apache Pinot Daily Email Digest (2022-05-05)

### _#general_

  
 **@kennyleejh:** @kennyleejh has joined the channel  
 **@ryan.ariasruane:** *Pinot Client Rust* Hi there. I wrote in the other day
about multi-value column ingestion jobs, and at the request of @mayanks, I
created the issue: . The reason I was trying to create a table with ingestion
of all possible types is because I am writing a rustlang client modelled after
. Here is the repo, if anyone is interested:  
**@ryan.ariasruane:** Given  I will have to remove the PQL support, however.  
**@kharekartik:** Hi Ryan, I am looking into this issue. Can you also send me
the link to discussion with Mayank. Or just high level points.  
**@ryan.ariasruane:** Everything and more about that issue is on the issue
itself, including schema, ingestion job, and data  
**@ryan.ariasruane:** I more meant to let the community know about the rust
client here  
**@kharekartik:** cool. I'll update you in some time. The rust-client will be
amazing! Thanks a lot for the contribution!  
**@ryan.ariasruane:** The client provides a DataRow type for SQL responses,
which is strongly typed. It does this by first deserializing the result table
to json, and then rationalizing the types based on the schema. Alternatively,
a row can be deserialized to any struct implementing `serde::Deserialize` ,
which it does by converting each row to a json map of column name to value,
and then passing that to the final deserializer. Ultimately there's a question
of whether result table deserialization might be better supported
(faster/lower memory) by implementing a custom serde deserializer, but for now
I just inserted myself as a middle man and manipulated the row structure into
one already supported for arbitrary deserialization.  
**@kharekartik:** We are also planning to introduce binary responses so that
serialization/deserialization time can be reduced. Sort of like proto/thrift
response instead of json from broker.  
**@ryan.ariasruane:** Interesting. Good to know.  
**@ryan.ariasruane:** I would welcome any feedback and contributions. Apart
from removing PQL, and some mild uncertainty about edge cases for array
columns of the issue referenced, the repo is usable. Examples and
documentation are provided, and everything is tested.  
 **@shivams:** @shivams has joined the channel  
 **@vienna:** @vienna has joined the channel  
 **@tonya:** Hi, folks! :wave: StarTree and Cisco Webex are co-hosting a
virtual MeetUp on 12May at 7p CDT called  :pinot:  :computer:  
**@ashwinviswanath:** If this is going to be held over a YouTube link similar
to how today there was a YouTube live link where Neha was interviewed by Pete
Soderling of Data Council, or even over Webex, can we have the link instead of
having to register with Meetup?  

### _#random_

  
 **@kennyleejh:** @kennyleejh has joined the channel  
 **@shivams:** @shivams has joined the channel  
 **@vienna:** @vienna has joined the channel  

###  _#troubleshooting_

  
 **@kennyleejh:** @kennyleejh has joined the channel  
 **@nikhil.varma:** Hi everyone, im trying to store the segments in the minio
as s3 deepstorage but there it is writing the temporary segments instead of
saving the segments. please help on this if anyone face this issue earlier.
Thank you  
**@mark.needham:** can you share your config?  
**@nikhil.varma:** Here im using minio as a s3  
**@nikhil.varma:** @mathew.pallan  
**@mark.needham:** your config looks good to me. Let me try to reproduce it  
**@mark.needham:** I setup everything with your config and MinIO running
locally and it seems like it works for me. I wrote it up here -  \- in case
you can see what I have different  
**@nikhil.varma:** Thanks for the response ill go through it  
**@mayanks:** Thanks @mark.needham  
**@shivams:** @shivams has joined the channel  
 **@vienna:** @vienna has joined the channel  
 **@diogo.baeder:** Hi folks, quick question: table indexes - whatever the
type - are created inside each segment, and don't ever cross segments, right?
Asking just to confirm.  
**@richard892:** yes that's right  
**@diogo.baeder:** Cool, thanks! This is just for me to be more aware of the
sizes of the index I'll end up having, to make them more adequate.  

###  _#community_

  
 **@tonya:** @tonya has joined the channel  

###  _#announcements_

  
 **@tonya:** @tonya has joined the channel  

###  _#getting-started_

  
 **@kennyleejh:** @kennyleejh has joined the channel  
 **@harish.bohara:** One question - select count(*) from table where
pipeline=‘TRANSACTIONAL’ and eventTime > ‘2022-05-04 22:12:53.791’ -> this is
giving count 10-20K But if i use following to check the plan: explain plan for
select count(*) from table where pipeline=‘TRANSACTIONAL’ and eventTime >
‘2022-05-04 22:12:53.791’  
**@mark.needham:** @amrish.k.lal any ideas?  
**@harish.bohara:** Verison 0.10.0  
**@mayanks:** I couldn’t understand the question from your message
@harish.bohara  
**@mark.needham:** I think it's the next message with the screenshot. Like why
does it say 'no matching segment'  
**@mark.needham:** when if you run the query you get a result  
**@mayanks:** Oh, thread view fooled me  
**@mark.needham:** yeh sorry - I replied on a thread to the wrong message  
**@mark.needham:** only realised after I pressed enter  
**@harish.bohara:** Yes when I run the query I get results. However if I try
to check the plan it shows "no matching segments "  
**@mayanks:** I think it means number (not No)  
**@mayanks:** That should be fixed  
**@mayanks:** Looking at the code, seems it is really NO matching segments and
not number. @amrish.k.lal this seems incorrect, could you please take a look?  
**@steotia:** Can I see the output of EXPLAIN PLAN ?  
**@mayanks:**  
**@amrish.k.lal:** I am wondering if this is caused by the fact that we pick a
random segment on a random server to generate EXPLAIN PLAN output. Do they
consistently get the same output if they run the EXPLAIN PLAN query multiple
times?  
**@amrish.k.lal:** Richard recently opened an  on this which has some more
details.  
**@amrish.k.lal:** @mayanks Yes, `NO_MATCHING_SEGMENT` is really no matching
segments on the particular server from which we got EXPLAIN PLAN results.  
**@mayanks:** @amrish.k.lal Yep I saw that, I was asking why that is returned
in this case, as it seems incorrect,  
**@amrish.k.lal:** Without knowing more, I can only assume that they have
multiple servers with multiple segments on each server and the server that we
randomly picked to run EXPLAIN PLAN on did not have any matching segments.
Thats why I am wondering if they see some variation in EXPLAIN PLAN output if
the query is run multiple times.  
**@richard892:** the problem with this approach is it bypasses pruning  
**@richard892:** so it will claim that something pinot would never do will
happen, and if you have enough segments pruned before the query evaluates you
have to get _really_ lucky to find a segment that could be used to generate a
realistic plan  
 **@harish.bohara:** I get this..  
 **@harish.bohara:** Why it does not show the plan details  
 **@shivams:** @shivams has joined the channel  
 **@mayanks:** I couldn’t understand the question from your message
@harish.bohara  
**@mayanks:** Limit does not apply to aggr without group by. Also 0 and 1 are
operator id. So still unclear on the question  
**@mayanks:** But yes @amrish.k.lal for what’s the purpose of showing limit 10  
 **@amrish.k.lal:** @amrish.k.lal has joined the channel  
 **@vienna:** @vienna has joined the channel  
 **@grace.lu:** Hi team, I would like to understand if Pinot has certain query
caching/warm up mechanism behind the scene? Asking because I noticed that the
first run of a query is always the slowest, for example when I run a count
group by query against a table for first time it takes 3000ms, but if I run it
again in next couple minutes, the same query consistently taking less than
100ms.  
**@mayanks:** This is typically due to Pinot MMAP’ing segments. Could you
share: ```- Do you have local disk vs remote/EBS? \- Are those SSD vs HDD? \-
What's the total RAM and what's your xms/xmx? \- Is the query triggering a ton
of random reads? This can sometimes be avoided by picking the right sorted
index.```  
**@grace.lu:** Hi @mayanks, we are using attached ebs volume with GP2 (ssd)
storage, memory: 80G, -Xms40G -Xmx50G. This observation is not a problem for
us for now but we are curious to know what’s happening behind the scene.  
 **@mayanks:** Looking at the code, seems it is really NO matching segments
and not number. @amrish.k.lal this seems incorrect, could you please take a
look?  
 **@steotia:** @steotia has joined the channel  

###  _#introductions_

  
 **@kennyleejh:** @kennyleejh has joined the channel  
 **@shivams:** @shivams has joined the channel  
 **@navina:** @navina has joined the channel  
 **@vienna:** @vienna has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org