You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/05/15 02:00:17 UTC

Apache Pinot Daily Email Digest (2021-05-14)

### _#general_

  
 **@mbracke:** @mbracke has joined the channel  
 **@brijdesai6:** @brijdesai6 has joined the channel  
 **@laurachen:** @laurachen has joined the channel  
 **@aaron:** I got some data ingested and am using a star tree index and I'm
running a query like `select foo, percentiletdigest(bar, 0.5) from mytable
group by foo` . I've got `foo` in my `dimensionsSplitOrder` and I've got
`PERCENTILE_TDIGEST__bar` as well as `AVG__bar` in my `functionColumnPairs` .
My query takes about 700 ms but if I switch it to `avg(bar)` it takes 15 ms.
Is it expected that the t-digest would be that much slower? Anything I can do
to speed it up?  
**@fx19880617:** @jackie.jxt does pinot support percentile tdigest in
startree?  
**@fx19880617:** in response stats, do you see same number of docs scanned for
both queries?  
**@jackie.jxt:** Yes, startree supports TDigest. See  for more details  
**@jackie.jxt:** Is the query constantly taking 700ms?  
**@aaron:** For avg and percentiletdigest, numDocsScanned is 969792.  
**@aaron:** Yeah, consistently in that range. It just took 1057 ms when I ran
it  
**@mayanks:** Yeah tdigest aggregation over 1M docs might take that long  
**@aaron:** What does `numDocsScanned` mean in the context of a star tree
index?  
**@mayanks:** Do you have query latency with just tdigest?  
**@aaron:** What do you mean?  
**@mayanks:** Query with percentile tdigest but without avg  
**@aaron:** Oh sorry, that's what I meant  
**@mayanks:** Oh ok  
**@mayanks:** Docs scanned should mean the same  
**@aaron:** `select foo, percentiletdigest(bar, 0.5) from mytable group by
foo` is slow, `select foo, avg(bar) from mytable group by foo` is fast  
**@mayanks:** Split order helps with filtering  
**@mayanks:** @jackie.jxt does it help with group by or just filtering?  
**@aaron:** If I have 969792 numDocsScanned and 8950109972 totalDocs, what
does numDocsScanned mean? Is that the number of star tree nodes or something?  
**@jackie.jxt:** @mayanks Most time just filtering  
**@jackie.jxt:** @aaron Do you need 0.5 percentile or 50 percentile? The
aggregation cost of `percentiletdigest` is expected to be much higher than
`avg`  
**@aaron:** Eh I don't actually care about which percentile just yet -- just
the performance  
**@aaron:** Is there anything I can do to speed it up? A lot of my users here
prefer quantiles, I think performance there will really matter  
**@aaron:** The avg performance is... awesome  
**@mayanks:** Your query does not have filters  
**@mayanks:** Will it be the case always?  
**@aaron:** Could be  
**@aaron:** Right now I only have a small subset of the data, but yeah people
might be filtering by date at the very least  
**@aaron:** Do you expect filters to help a lot?  
**@mayanks:** It will cut down numDocsScanned right  
**@aaron:** Right  
**@aaron:** I'd expect people to be scanning a similar number of documents if
not an order of magnitude more  
**@mayanks:** @jackie.jxt Any ideas on using pre-aggergates within star tree
here?  
**@mayanks:** Also, @aaron In production you'll have the same cluster size as
of right now? Because if you'll have more servers, you'll get better perf  
**@jackie.jxt:** If `foo` is the first dimension in the split order, then it
will always use the pre-aggregate doc  
**@jackie.jxt:** @aaron What's the cardinality of `foo`? How many segments do
you have right now?  
**@aaron:** Foo's cardinality is about 6  
**@aaron:** 462 segments  
**@aaron:** 5 servers  
**@aaron:** Foo is third in dimensionsSplitOrder, there are 7 fields total in
there  
**@jackie.jxt:** In that case, in order to further optimize the performance,
you may reduce the `maxLeafRecords` threshold. While this will increase the
size of the star-tree  
**@mayanks:** Just to callout, a lot of the latency inherently comes from the
TDigest library.  
**@mayanks:** It is pretty good in providing accuracy in limited storage, but
there's a latency cost.  
**@aaron:** Is q-digest any better? My understanding was that t-digest is
faster and more accurate  
**@aaron:** Do you have any approximate guidelines around how much faster
performance will be and how much more space the star tree will take up as
maxLeafRecords is decreased?  
**@mayanks:** Yes, t-digest is definitely better than others. But it may not
give you 10ms latency if you are aggregating 1M records.  
**@aaron:** How can I get to, say, 200ms?  
**@mayanks:** Tuning star tree (Jackie?), index size, server cores/jvm/params,
etc  
**@jackie.jxt:** For star-tree, you can trade performance with extra space by
reducing the `maxLeafRecords`  
**@jackie.jxt:** Reducing that to 1 will give you fully pre-cubed data  
 **@benjamin.walker:** @benjamin.walker has joined the channel  
 **@aritra55:** @aritra55 has joined the channel  
 **@oneandwholly:** @oneandwholly has joined the channel  

###  _#random_

  
 **@mbracke:** @mbracke has joined the channel  
 **@brijdesai6:** @brijdesai6 has joined the channel  
 **@laurachen:** @laurachen has joined the channel  
 **@benjamin.walker:** @benjamin.walker has joined the channel  
 **@aritra55:** @aritra55 has joined the channel  
 **@oneandwholly:** @oneandwholly has joined the channel  

###  _#troubleshooting_

  
 **@jmeyer:** Hello ! :wave: *I've got the following scenario :* • Data is
integrated in multiple batches per day (in an OFFLINE table) ◦ *Batch 1:*
_01/01/2021 (data date) - DATA 1, DATA 3, DATA 6 ->
`Segment_1(date=01/01/2021, data=[DATA 1, DATA 3, DATA 6])`_ ◦ *Batch 2:*
_01/01/2021 (data date) - DATA 2, DATA 4, DATA 5 ->
`Segment_2(date=01/01/2021, data=[DATA 2, DATA 4, DATA 5])`_ • Data must be
available asap, so 2 separate segments are generated & ingested into Pinot •
Some data needs to be corrected after the initial data ingestion, say DATA 1 &
DATA 2 I know it is possible to replace segments but, how can we handle
replacing data across multiple segments ? Can we generate a new segment with
only the modified data and ignore old data in previous segments ? (`Segment_1`
& `Segment_2`) -> `Segment_3(date=01/01/2021, data=[DATA 1, DATA 2])` Or do we
have to regenerate the 2 segments entirely ? (if so, we need to identify what
they contain) - Possibly after merging them ?  
**@mayanks:** You can regenerate the two segments (using same name as existing
segments) and push them to Pinot. Currently this is not an atomic transaction
so there may be some small time period when one segment is old and another is
new. This is being worked on to fix. @snlee  
**@jmeyer:** Thanks @mayanks So it is necessary to know the contents of the 2
segments and regenerate them with the same data as before (+ updates) ? Sounds
like this could be non trivial in some cases  
**@jmeyer:** What is Pinot behavior when duplicated data exists ? A 3rd
segment with some data already present in the first 2 The notion of
"duplicated" implies we have a primary key, which is not the case on OFFLINE
table iirc So I guess we would simply have "duplicated" lines  
**@mayanks:** Pinot won’t know that it is duplicate data, and will be included
in query processing  
**@mayanks:** If you are generating daily segments then replacing one days
segments should be straight forward  
**@jmeyer:** > If you are generating daily segments then replacing one days
segments should be straight forward The difficulty is that not all day's data
may / will arrive at the same time but ingestion Hence • *Batch 1:*
_01/01/2021 (data date) - DATA 1, DATA 3, DATA 6 ->
`Segment_1(date=01/01/2021, data=[DATA 1, DATA 3, DATA 6])`_ • *Batch 2:*
_01/01/2021 (data date) - DATA 2, DATA 4, DATA 5 ->
`Segment_2(date=01/01/2021, data=[DATA 2, DATA 4, DATA 5])`_ In the end, I
feel like my question is "how can we update part of a segment" ? I feel like
it's not possible then It looks like there's only 2 ways to reach my goal then
: 1\. Only have a single segment per day at a time so a. Drop day's segment b.
Regenerate segment with updated data [99% of the data may not have changed, so
pretty inefficient] 2\. Identify impacted segments & regenerate impacted one
only (in their entirety) What do you think ? :slightly_smiling_face:  
**@mayanks:** Is your offline pipeline not generating daily partitions?
Typically offline pipelines would created time partitioned folders, and
generating segment from one folder will guarantee to not overlap with other
days  
**@jmeyer:** It is but we have 3 additional constraints • Data for a given day
can arrive in multiple parts (for the same day) [imagine the case with N
timezones} • Partial data need to be available asap (can't wait for other
parts) • Need to be able to update some data later on (doesn't need to be
perfectly efficient, as it's clearly not an ideal case for OLAP)  
**@mayanks:** Do you not have realtime component? If you do then you can serve
data from realtime while your offline settles?  
**@jmeyer:** I feel like this would help, but no, data comes in batches from
external sources.. I'll keep that in mind still  
**@mayanks:** What is the max delay for data to arrive? Does one day's worth
of data settle in a day or so? Or it can take several days / weeks?  
**@mayanks:** Also, even if your incoming data is not partitioned, you can
always generate segments to guarantee the data belongs to one day (eg pick
several folders to scan and select data only for single day to generate input
for pinot segment)  
**@jmeyer:** > What is the max delay for data to arrive? Does one day's worth
of data settle in a day or so? Or it can take several days / weeks? Typically
much less than a month but it is technically unbounded - customer data could
theoretically be corrected months after first ingestion  
**@mayanks:** When it arrives after a month, which folder does it land in? Is
it in the correct date folder? Also, how do you know which older folders got
changed?  
**@mayanks:** Throwing an idea, if you can find the delta between what was
pushed to Pinot as part of daily push, and the delta so far across all days,
you can have a set of segments for daily, and another set (perhaps very small
say 1 or 2 segments) that represent delta across all days, and keep refreshing
that delta segments  
**@mayanks:** It works if your delta is tiny, but may not scale if delta is
huge  
**@jmeyer:** > Also, even if your incoming data is not partitioned, you can
always generate segments to guarantee the data belongs to one day (eg pick
several folders to scan and select data only for single day to generate input
for pinot segment) If I understand correctly, you're saying that after every
batch, we generate the whole pinot segment ? For example, we've got a single
file per batch and after every batch, we could regenerate a single pinot
segment from every one of these files Meaning we always keep a single Pinot
segment (per day) at a time, and replacing it is straightforward  
**@mayanks:** Discussed offline, @jmeyer to summarize.  
**@jmeyer:** Yes :slightly_smiling_face:  
**@jmeyer:** *Summary :* _*Context :*_ • Data comes in batches every day (for
the same day) • Each batch generates a new file • Data must be available asap
(i.e. can't wait having all the data before generating a segment) • Data
correction can come in later (weeks) *Solution discussed :* • While, every
batch generates a new separate file, the goal is to keep having a single Pinot
segment per day at a time • To do so, after every next batch, ◦ Merge every
file for the day before calling CreateSegment to generate a new segment
containing all (existing) data for the day ▪︎ Later, a new feature will allow
generating a single Pinot segment out of multiple input files, dropping the
need for file concatenation ◦ This new segment will replace the existing one
(for the day)  
**@jmeyer:** This solution means that we only need to regenerate a single
segment per day impacted with data correction However, if data correction
happens along another dimension than time, say if we have (date, entity,
value) - correcting all values for a given entity will result in the
regeneration of *all* segments  
**@jmeyer:** @mayanks Summary sounds ok ?  
**@mayanks:** Yes, thanks  
 **@mbracke:** @mbracke has joined the channel  
 **@brijdesai6:** @brijdesai6 has joined the channel  
 **@laurachen:** @laurachen has joined the channel  
 **@mayanks:** @jlli Do we have a doc to describe the preprocessing for
partition/sort before ingestion? If so could you share? If not, could we add
the doc? cc: @syedakram93  
**@jlli:** Hey @syedakram93, yes we do have the design doc on preprocessing
job, while it’s still in LinkedIn internal dir. Let me put it to the wiki
page. In the meantime, you can refer to this file to see how it’s getting
used:  
**@mayanks:** @jlli If we can add it to , that would be great  
**@jlli:** Yeah that’s where I’m going to add to  
**@mayanks:** thanks  
 **@benjamin.walker:** @benjamin.walker has joined the channel  
 **@aritra55:** @aritra55 has joined the channel  
 **@oneandwholly:** @oneandwholly has joined the channel  
 **@ken:** I’ve been fooling around with how Pinot handles the “URI push” of
segments. It seems like if I’m not using HDFS for deep storage, then the
controller will download the segments before pushing to the server, which
seems like it’s not a win. Is that correct? And (so far) I haven’t been able
to configure the controller to successfully handle an HDFS URI push request,
at least when I’m not using HDFS for deep storage - I see the msg when the
controller starts up that the “hdfs file system” was initialized, but when it
gets the URI push request, it fails with an error about the hdfs file system
not being initialized. Any ideas?  
**@mayanks:** URI push should work for all deep-storages that provide uri
based access (HDFS/ADLS/GCP/S3), only exception is NFS I'd think  
**@mayanks:** Unsure about why you are seeing that behavior, would need more
debugging.  
**@ken:** I was trying to figure out if you could do an HDFS URI push without
enabled deep storage for the same. So instead of pushing actual segments
through the controller to be stored locally by server processes, you’d push
the URI and the server process would download locally. Sounds like that’s not
supported.  
**@g.kishore:** you need to use URI with metadata push  

###  _#pinot-dev_

  
 **@mayanks:** @snlee, did we get any consensus on the 0.8.0 release timeline?  
 **@snlee:** Here is the changelog since the last release from the master’s
branch. I’m trying to come up with the list of new major features. If anyone
has the feature to highlight, please update to this thread.  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org