You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/04/21 02:00:32 UTC

Apache Pinot Daily Email Digest (2022-04-20)

### _#general_

  
 **@abhijeet.singh:** @abhijeet.singh has joined the channel  
 **@m.akash2209:** @m.akash2209 has joined the channel  
 **@yahya.zuberi:** @yahya.zuberi has joined the channel  
 **@harish.bohara:** I have a basic question on Pinot search (I am sure Pinot
will have it - Just want to know internals of how it does it) - suppose I have
thousands of segment files over long time. How does a query avoids looking at
all these files to give a query result- e.g. if i query for a data for time
range. Does Pinot know which segment files to read to fulfil this query. Any
detailed doc will help which can explain this..  
**@kharekartik:** @richard892  
**@francois:** I will give you what I’ve understood :wink: Segment can be seen
as partition and each segment have his own indexes to provide fast and
reliable response. In addition to that segment pruning can be added to pre-
filters witch segment to query based on a query filters predicates you time
rage for exemple. Pinot expert will give you more detail from the inside of
course :slightly_smiling_face: Still a newbie  
**@richard892:** there's lots to say about this topic but the keyword to
search for is "segment pruning" - best to start here  
**@parthiv.shah.777:** @parthiv.shah.777 has joined the channel  
 **@yahya.zuberi:** Hey everyone, So we are exploring to use GRAPHQL on pinot.
Any experiences, recommendations and challenges that anyone have please share
that can make our POC helpful. Regards  
**@g.kishore:** Are you planning to use any tool/library or writing custom
handler to convert graphql to Pinot sql query  
**@yahya.zuberi:** We were planning to do it manually are there any
tools/library available @g.kishore  
**@g.kishore:** Not that we are aware of.. if you plan to do it in open
source, we will be happy to guide/review.  
 **@saumya2700:** hi All, I have a question about what is the best way to
create segments, like I have realtime table which is ingesting data from kafka
topic and topic has deviceid as key, should we create segments as per device
id or it should be based on default time based segments, our queries mostly
have searches for device id and time range. If we create segments as per
device Id will the same device id data go in same segment and if it is like
that query will be faster and will it look only for segments which has these
deviceIds, how will it work.  
**@g.kishore:** Pinot supports both time and space partitioning  
**@g.kishore:** In most cases, it’s better to partition by time first (day)
and then space(deviceid)  
**@saumya2700:** any sample , or document where both are used . I am
struggling how to do it for both what config is required for that.  
**@g.kishore:**  
**@g.kishore:** Time partition happens automatically.. it segment metadata has
start time and end time which get as used during pruning  
 **@iladro:** @iladro has joined the channel  
 **@tim58jsy:** @tim58jsy has joined the channel  

###  _#random_

  
 **@abhijeet.singh:** @abhijeet.singh has joined the channel  
 **@m.akash2209:** @m.akash2209 has joined the channel  
 **@yahya.zuberi:** @yahya.zuberi has joined the channel  
 **@parthiv.shah.777:** @parthiv.shah.777 has joined the channel  
 **@iladro:** @iladro has joined the channel  
 **@tim58jsy:** @tim58jsy has joined the channel  

###  _#troubleshooting_

  
 **@abhijeet.singh:** @abhijeet.singh has joined the channel  
 **@m.akash2209:** @m.akash2209 has joined the channel  
 **@yahya.zuberi:** @yahya.zuberi has joined the channel  
 **@parthiv.shah.777:** @parthiv.shah.777 has joined the channel  
 **@yeongjukang:** Hello folks, I have some questions about memory consumption
of server instances. • How much RAM will be taken(roughly estimated) by a
segment if I ingest 687MB csv, which generates 109MB segment gzipped tar file?
• I have 5 server nodes and each has 4272/3884/4438/3493/3661 segments. They
took 27/8/6/3/18 Gi RAM each. What makes them different from each other? kind
of raw data cache? Thanks in advance.  
**@richard892:** hi, the first question is impossible to answer, it depends on
the data type, the cardinality of the columns, the compression configuration,
what indexes you have and so on  
**@richard892:** as for the second question, this is because of partitioning,
what column are you partitioning on, and is there skew in the distribution of
number of records or of the size of the records?  
**@yeongjukang:** @richard892 Thanks for reply. I didn't specify partition
option for tables. About distribution, records seem to be distributed
appropriately.  
**@yeongjukang:** About size consumption, could you give me a case you know or
a public example?  
**@yeongjukang:** Additionally, I am running servers on k8s cluster. After
deleting 27Gi RAM usage pod, it took only 13Gi RAM. Could it be another clue
for this?  
 **@diogo.baeder:** Hey guys, sorry to ask something about Trino here, but
it's related to Pinot and unfortunately I couldn't get help from anyone at
their community. I'm doing an experiment with Trino and Pinot, and I noticed
that a query I do to different tables in Pinot ends up with Trino actually
querying all of the existing segments in Pinot, completely bypassing any
partitioning I defined for my tables. Is this expected? Has somebody here ever
experienced this as well, but solved it? I'm thinking about using dynamic
tables in Trino to work around that issue, but it just feels dirty to have to
do that...  
**@g.kishore:** whats the query? @elon.azoulay can provide more info.  
**@diogo.baeder:** Sorry for the delay; Here's an example query I run, which
involves querying inside a specific period of time, but Trino bypasses that
and instead queries all segments regardless of the dates: ```select distinct
weight from weights inner join brands_metrics on weights.pmxid =
brands_metrics.pmxid inner join filters on weights.pmxid = filters.pmxid where
brand_id = 1000226 and filters.name = 'gender' and contains(filters."values",
2) and weights.date > 20130501 and weights.date < 20130515 and
brands_metrics.date > 20130501 and brands_metrics.date < 20130515 and
filters.date > 20130501 and filters.date < 20130515``` here, `brands_metrics`,
`weights` and `filters` are all tables that contain a `date` column set as the
time column upon table creation.  
 **@iladro:** @iladro has joined the channel  
 **@tim58jsy:** @tim58jsy has joined the channel  
 **@abhijeet.kushe:** I am using Pinot 0.9.1 I wanted to know how order by
works The below query returns a response ```select taskName, taskResult,
distinctcount(personId) from events where accountId = 1100609261882 AND workId
= '40d9652e-c543-4bd2-aa4d-a11c7b23a6df' group by taskName, taskResult order
by mode(createdOn) asc limit 10000 ``` but this throws an exception ```select
taskName, taskResult, distinctcount(personId) from events where accountId =
1100609261882 AND workId = '40d9652e-c543-4bd2-aa4d-a11c7b23a6df' group by
taskName, taskResult order by createdOn asc limit 10000 ``` ```[ { "message":
"QueryExecutionError:\nProcessingException(errorCode:450,
message:InternalError:\njava.lang.NullPointerException\n\tat
org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:230)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:120)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)",
"errorCode": 200 } ]``` `mode(createdOn) asc` is the difference what makes it
work ..is this a bug ?  
**@npawar:** you’d have to explicitly add createdOn into the group by clause.
Otherwise if you group as (taskName, taskResult) -> (distinctCount(personId))
what is the created on that should we considered? there can be several in each
group rt?  
**@abhijeet.kushe:** So the reason why I need to add order by to the query is
so that `taskName,taskResult ,distinctcount` will be sorted by createdOn in
other words the first taskName and taskResult will be displayed first in the
output …I want a consistent or a guaranteed sort order either ascending or
descending. I don’t want to group by each createdOn timestamp  
**@npawar:** in that case, can you put max(createdOn) in the selection, and
order by that?  
**@npawar:** ```select taskName, taskResult,
distinctcount(personId),max(createdOn) from events where accountId =
1100609261882 AND workId = '40d9652e-c543-4bd2-aa4d-a11c7b23a6df' group by
taskName, taskResult order by max(createdOn) desc limit 10000 ```  
**@abhijeet.kushe:** @npawar thanks for your quick response.Above query works
so does the below  max(createdOn) needed in select clause ? ```select
taskName, taskResult, distinctcount(personId) from events where accountId =
1100609261882 AND workId = '40d9652e-c543-4bd2-aa4d-a11c7b23a6df' group by
taskName, taskResult order by max(createdOn) desc limit 10000 ```  
**@npawar:** ah, yes, you dont need it in projection  
**@abhijeet.kushe:** Awesome thanks  

###  _#getting-started_

  
 **@abhijeet.singh:** @abhijeet.singh has joined the channel  
 **@m.akash2209:** @m.akash2209 has joined the channel  
 **@yahya.zuberi:** @yahya.zuberi has joined the channel  
 **@yahya.zuberi:** Hello, need to connect pinot with our java/kotlin spring
boot application 1\. Do we have pooling options while using JDBC  
**@kharekartik:** Hi, No we don't support pooling in our JDBC client
currently. You can create a ticket for it on  Will be happy to take it up  
**@kharekartik:** You can also use our Java client (non-JDBC) which provides
Async query methods.  
**@yahya.zuberi:** Hi Kartik, can we manually create pool? @kharekartik  
 **@ysuo:** Hi, I have a tricky question. “NAN” values are stored in my
realtime type table. My table config is like below. And the value in Kafka
message is like {“key”:{“price_value”:“NAN”}}. Is there some way I can use to
filter this NAN messages in the ingestion phrase or in the query phrase?
“metricFieldSpecs”: [ { “name”: “price_value”, “dataType”: “DOUBLE” } ]  
**@kharekartik:** Hi You can use `filterConfigs` in the table for this  
**@parthiv.shah.777:** @parthiv.shah.777 has joined the channel  
 **@iladro:** @iladro has joined the channel  
 **@tim58jsy:** @tim58jsy has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org