You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/01/16 02:00:14 UTC

Apache Pinot Daily Email Digest (2021-01-15)

### _#general_

  
 **@zxcware:** Hi team, when should I set `exclude.sequence.id` ? Is it used
just for naming the segment? If I create 3 segments, each with a unique name
but having the same time-range, can I set `exclude.sequence.id` true all the
time?  
**@fx19880617:** it typically used when your input diretory is root and it
contains multiple days and you want to each day to have the same segment name
when you re-run the job  
**@zxcware:** I see. Does the name mean anything else to the query engine?
Does the engine look at the name of the segment for filtering?  
**@fx19880617:** e.g. we bootstrap a root directory `/my/data/` and it
contains `/my/data/yyyy=2020/mm=1/dd=1/20200101.avro` and
`/my/data/yyyy=2020/mm=1/dd=2/20200102.avro` with `exclude.sequence.id` you
will see two segments named `myTable_20200101_2020101` and
`myTable_20200102_2020102`  
**@fx19880617:** no  
**@fx19880617:** it doesn’t do anything to query path  
**@fx19880617:** Pinot uses segment name for data replace  
**@zxcware:** Got it. If I'm going to replace 3 segments with 1 big one (to
compact small segments into big one), is it possible to do this seamlessly?  
**@fx19880617:** which means if you generate a segment name with
`exclude.sequence.id=false` , in above example, you will see segment name
`myTable_20200102_2020102_1` and then if you just want to replay segment
creation on 2020-01-02, it will generate segment name:
`myTable_20200102_2020102_0`  
**@fx19880617:** which won’t replace the old segment  
**@fx19880617:** hmm, it’s a transactional segment replacement. I don’t see a
way to do it seamlessly right now. @snlee is adding support for group segments
replacement, how is it going  
 **@zxcware:** Hi team, does this config
`controller.offline.segment.interval.checker.frequencyInSeconds` control when
added/updated offline segments are actually used? Is there a cron schedule to
control when new segments take effect?  
 **@zxcware:** I see. There is an explicit reload command  
**@npawar:** Any segments you add or update, should immediately be used. You
don't need to trigger reload for it.  
**@mayanks:** Also, the config that you mentioned above is for how often
should the segment interval checker be run: ```Manages the segment validation
metrics, to ensure that all offline segments are contiguous (no missing
segments) and that the offline push delay isn't too high.```  
 **@sandeep:** @sandeep has joined the channel  
 **@amitchopra:** Hi, I have a question around broker / server pruning. I have
2 servers and 4 segments. The mapping is: • server-0 1\.
metrics_OFFLINE_26835599_26835666_3 2\. metrics_OFFLINE_26835733_26835799_2 •
server-1 1\. metrics_OFFLINE_26835799_26835866_0 2\.
metrics_OFFLINE_26835666_26835733_1 When i do a query like `select device,
count(device) as aggreg from metrics where eventTime > 26835599 and eventTime
< 26835626 group by device order by aggreg desc limit 10` I see: •
*numServersQueried = 2* • *numServersResponded = 2* • *numSegmentsQueried = 4*
• *numSegmentsProcessed = 1* • *numSegmentsMatched* = 1 Questions: 1\. Given
above query, the `eventTime` falls within time range of a single segment -
`metrics_OFFLINE_26835599_26835666_3` . So i was expecting *numServersQueried*
to be 1 (instead of 2). Do i need to set something up for broker pruning to
take effect? 2\. Similarly i was expecting *numSegmentsQueried* to be 1
(instead of 4). 3\. I always see *numSegmentsProcessed* and
*numSegmentsMatched* to be same value always. What is the difference between
the two. I looked at , but it wasn’t super clear to me from reading there.  
**@steotia:** • numSegmentsQueried is equal to the number of segments broker
decided to query • numSegmentsProcessed is the number of segments server
decided to query after all the pruning (if any) • numSegmentsMatched are those
segments where at least 1 matching row for the query was found on the servers.
In your case, it happens to be in all processed segments • To reduce the
number of segments queried, pruning can be used. Broker can prune on the basis
of partition column if your table is partitioned and the partitioning key is
used in the query with = predicate. Server can prune on the basis of time
column filter. Server can also prune using bloom filter if bloom filter is
created on the column you are using in the query with = filter  
**@amitchopra:** @steotia Then does *numSegmentsQueried* imply the total
number of segments?  
**@amitchopra:** And can broker not apply pruning based on time column. And
only server can apply that pruning?  
**@steotia:** numSegmentsQueried is the number of segments broker decided to
query. If there is no partitioning, this will be equal to the number of
segments in the table. Broker side time column based pruning is there. Support
was recently added. Not sure if it is already out in the latest release. and
if there is more remaining work here.  
**@steotia:** @jiapengtao0 may know if broker side time column pruning is
available in the release and how to enable it  
**@amitchopra:** Thanks @steotia. So based on above, looks like server pruning
is happening, but broker pruning is not kicking in. Will wait for response
from @jiapengtao0 on how to get broker pruning to work. BTW i am running 0.6
version as of now  
**@steotia:** @jiatao ^^  
**@jiatao:** The feature is merged recently, seems like release 0.6.0 did not
cover it.  
**@amitchopra:** @jiapengtao0 do you know when next version will be released?  
 **@ken:** Hi @amitchopra - I think you want to check out partitioning on , as
a way of avoiding sending the query to all servers (with broker-side pruning).  
**@amitchopra:** Thanks @ken. I did look at this, though i felt this might be
required if the data is partitioned by some dimensional field. Is this also
required for time dimension field as well?  
**@ken:** You’re right that the time dimension is special, but (sadly) I don’t
know whether that changes how you’d configure things to prune via
partitioning.  
**@mayanks:** ``` if
(RoutingConfig.TIME_SEGMENT_PRUNER_TYPE.equalsIgnoreCase(segmentPrunerType)) {
TimeSegmentPruner timeSegmentPruner = getTimeSegmentPruner(tableConfig,
propertyStore); if (timeSegmentPruner != null) {
segmentPruners.add(timeSegmentPruner); } }```  
**@mayanks:** Based on quick check at the code: we do have time based segment
pruning at the broker level ^^  
**@amitchopra:** @mayanks Is the broker level pruning based on time enabled by
default? Or do i need to set the routing config accordingly?  
**@mayanks:** The code is looking for RoutingConfig.  
**@jiatao:** Hi @amitchopra , the broker time pruner is not enabled by
default. To enable it, you need to add routing config like following:
```"routing": { "segmentPrunerTypes": ["Time"] }```  
**@amitchopra:** Thanks. Let me try this out. Though one question, is this
supported in version 0.6?  
**@jiatao:** It's merged recently, seems like it's not in 0.6.  
**@mayanks:** @jiatao could we update the docs as well?  
**@jiatao:** Sure, I'll update it.  
**@jiatao:** @mayanks Any idea when we'll cut next release?  
**@mayanks:** Oh this is not part of the 0.6.0 release, perhaps we should
update the doc with the next release then.  
**@amitchopra:** @mayanks asking last question again, do you know when will
next version be released? So that we could take advantage of this  
**@mayanks:** There isn't a concrete plan that I am aware of, however, we have
been trying to get one every couple of months (last one was Nov/Dec). We can
bring this up in the dev channel.  
**@amitchopra:** ok, thanks Mayank  

###  _#random_

  
 **@sandeep:** @sandeep has joined the channel  

###  _#pql-2-calcite_

  
 **@humengyuk18:** @humengyuk18 has joined the channel  

###  _#troubleshooting_

  
 **@sandeep:** @sandeep has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org