You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/10/05 02:00:19 UTC

Apache Pinot Daily Email Digest (2021-10-04)

### _#general_

  
 **@adireddijagadesh:** @adireddijagadesh has joined the channel  
 **@lrhadoop143:** @lrhadoop143 has joined the channel  
 **@tahir.fayyaz:** @tahir.fayyaz has joined the channel  
 **@karinwolok1:** :wave: Heyyyy to all the new :wine_glass: members! :smile:
Please tell us who you are and what brought you here! @adireddijagadesh
@lrhadoop143 @tahir.fayyaz @iamluckysharma.0910 @nanda.yugandhar
@son.nguyen.nam @tgauchoux @senthissenthh @alinoorrahman @ssainz @cebofil371
@arunsundar4298 @ssandeepadas007 @barana @rjpatrick @meetdesai74 @moeryomenko
@kskarthik.columbia @rajesh.narayan @gsisodiya @nageshblore @rahul.oracle.db
@yifanzhao0210 @nakkul @pradeepks2003 @salkadam @arun11299 @rupesh_raghavan
@i.young @zainamro1 @yuchaoran2011 @cechovsky.jozef @bvarunpy @becca.silverman
@leb9882 @sanipindi @raj.swarnim @lxy1995seu @syed.hadi @ebuth @sirsh @mattk
@zineb.raiiss @frank_gao2 @suresh.intuit @weixiang.sun @chengweili402 @alec
@pascal.jerome  
**@sirsh:** Hi all!! I am (slowly) evaluating Pinot and hope to spend much
more time with it in the coming weeks. I came across Pinot after looking for
an alternative to Druid that might be easier to maintain but offer some of the
same types of OLAP over either Kafka data or data stored on an S3 data lake
(via presto). I am looking for a tool that augments what Snowflake provides
(in the space of ad hoc on higher latency queries.) I hope to sell my
organization on the power of going beyond the "snowflake model" to understand
assembly "flows" in manufacturing. K8s is important to me so a big part of my
evaluation will be how i make this work within my infrastructure (deployed via
Argo-CD along with all our other great open source apps) Look forward to
learning with you all!  
**@adireddijagadesh:** Hi All, I am here to contribute and raise PR’s.
Previously contributed to Kafka and Druid.  
**@alec:** My firm is in the very early stages of exploring Pinot and I am
just here to be able to search around for FAQs I have and keep an eye on the
roadmap. I work with Tiger Zhao.  
**@sanipindi:** Hello, I am here to explore Pinot in our organization for user
facing analytics in aws cloud infra. hugely interested in the operability and
maintainability aspects of it in production workloads.  
 **@dadelcas:** Hi, I'm experiencing a strange issue today. My realtime table
has bucketTimePeriod of 1d and bucketTimePeriod of 2d. My offline workflows
are not running and I can see a message in the logs that says "Window data
overflows into CONSUMING segments for partition of segments..." then "Found no
eligible segments for task: RealtimeToOfflineSegmentsTask with window
[1555200000 - 1641600000]. Skipping task generation...". Note these timestamps
seem to be in seconds instead of milliseconds. I can see the segment.end.time
and segment.start.time values are in seconds which I'm not sure of whether
this was the case before. Looking through the code I can see TimeUtils compute
the window using milliseconds so this is why the window spans 2 years instead
of 2 days. I'm trying to figure out why this is happening now, any help is
appreciated  
**@mayanks:** Do you mind pasting your table and task configs? I am guessing
some issue with time unit. cc: @npawar  
**@dadelcas:** It seems the schema was defined in microseconds and the table
in milliseconds. Is there an easy way to fix this issue?  
**@npawar:** we only read what is set in schema. the “timeColumnUnit” in table
config is ignored (also deprecated). So schema was MICROSECONDS and data had
values in millis?  
**@npawar:** it would be cleanest to just start the realtime table from
scratch. Because any segments that have completed so far, will have incorrect
time start/end  
**@dadelcas:** Yup, the data contains milliseconds  
**@dadelcas:** I wouldn't like to loose the test data I've got because I still
don't have a proper backfilling process in place. I could write a script to
update the znodes but that's a bit of effort I rather spend on working on
backfilling  
**@dadelcas:** Thanks both  
 **@camerronadams:** @camerronadams has joined the channel  

###  _#random_

  
 **@adireddijagadesh:** @adireddijagadesh has joined the channel  
 **@lrhadoop143:** @lrhadoop143 has joined the channel  
 **@tahir.fayyaz:** @tahir.fayyaz has joined the channel  
 **@camerronadams:** @camerronadams has joined the channel  

###  _#troubleshooting_

  
 **@msoni6226:** Hi All, I am running a hybrid table setup. If I delete a
table for some reason and recreate the table with same name, should the Minion
be re-started to make the realtime to offline working for this newly created
table with same name?  
**@mayanks:** As long as the new table config has the realtime to offline
configs in it, it will work.  
**@msoni6226:** Yes the REALTIME table has the realtime to offline
configuration defined. So, if we re-create table with same name, Minion re-
start is not required. Right?  
**@mayanks:** No, minions scan all table configs to identify jobs to schedule.  
**@msoni6226:** Thanks Mayank for confirming it  
 **@vibhor.jain:** Hi Team, We are planning to add a de-duplication flow for
our solution. As part of that, we are doing 2 things: 1\. Enable UPSERT in
realtime table. \- Flink has the key set. \- Primary key defined in schema and
realtime table using it. \- UPSERT working fine. 2\. For realtime to offline
flow via minion, we found that these duplicates were coming in OFFLINE table.
\- so we tried with mergeType: dedup (earlier it was concat) Now, the realtime
to offline flow has stopped working (no data in OFFLINE table, minion is up
and running) Queries: \-------- 1\. Is our dedup flow proper? UPSERT for
realtime and mergeType: dedup for realtime to offline flow? 2\. Any pointers
around why this realtime to offline flow stopped working after adding these
configs?  
**@mayanks:** Upsert feature currently works for realtime only tables, afaik  
**@mayanks:** cc @jackie.jxt @yupeng  
**@vibhor.jain:** Hi @mayanks, we are using UPSERT (i.e. "primaryKeyColumns":
[] defined in schema) for realtime table and its working fine. Are we saying
hybrid table flow is not supported here via mergeType: dedup?  
**@mayanks:** Hi @vibhor.jain I am saying UPSERT does not support hybrid table
at the moment, it is being worked on afaik.  
**@vibhor.jain:** mergeType: "dedup" should work there? Although we are doing
a full row comparison, that should be ok for deduplication?  
**@mayanks:** However, if let's say an upsert comes for the same primary key,
I don't think the code will take care of what is already moved to offline,
right?  
**@vibhor.jain:** Hi @mayanks We are not looking for UPSERT across realtime
and offline table. UPSERT is limited to realtime table and its working as
expected for us. Now, when we move this realtime data to offline, these
duplicate records are showing up again (UPSERT keeps all copies). So, we tried
mergeType config with value "dedup" to handle duplicates here and we see the
realtime to offline flow has suddenly stopped working.  
**@mayanks:** Ok, may be check the dedup config, and also see if errors in
minion log  
**@npawar:** you might also find logs about realtimeToOffline job in
controller. First some lines should appear in controller about task
scheduling, then you will see corresponding lines in minion logs about task
execution  
**@jackie.jxt:** Realtime to offline flow will move records to the offline
table (managed separately from the realtime table), and UPSERT cannot be
applied to the offline side. Dedup won't help with this setup  
**@jackie.jxt:** Please check the log on controller about the
realtimeToOffline task as Neha suggested. And also check if there are time
gaps. IIRC there is a known issue of realtimeToOffline task stuck when there
are time gaps in the data  
 **@adireddijagadesh:** @adireddijagadesh has joined the channel  
 **@lrhadoop143:** @lrhadoop143 has joined the channel  
 **@valentin:** Hello, I’m looking to enable segment replication on my cluster
and I found 2 way of do it in the documentation: •  with the `replication` key
•  with replica group What is the better way to do it? Maybe there is
something I misunderstood? Thank you  
**@npawar:** You can start off with the first way. Replica groups is an
advanced optimization when you want to reduce server fan-out and scale
throughout (as mentioned in that doc)  
 **@tahir.fayyaz:** @tahir.fayyaz has joined the channel  
 **@bajpai.arpita746462:** Hi All, I am trying to to move data from REALTIME
to OFFLINE table through minion flow but I am getting the below error:
Starting PinotTaskManager with running frequency of 1200 seconds. Start
running task: PinotTaskManager Trying to schedule task type:
RealtimeToOfflineSegmentsTask, isLeader: true Start generating task configs
for table: dataanalytics_REALTIME for task: RealtimeToOfflineSegmentsTask
Caught exception while running task: PinotTaskManager
java.lang.IllegalStateException: null at
.google.common.base.Preconditions.checkState(Preconditions.java:429) ~[pinot-
all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
org.apache.pinot.plugin.minion.tasks.realtime_to_offline_segments.RealtimeToOfflineSegmentsTaskGenerator.getWatermarkMs(RealtimeToOfflineSegmentsTaskGenerator.java:300)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
org.apache.pinot.plugin.minion.tasks.realtime_to_offline_segments.RealtimeToOfflineSegmentsTaskGenerator.generateTasks(RealtimeToOfflineSegmentsTaskGenerator.java:161)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
org.apache.pinot.controller.helix.core.minion.PinotTaskManager.scheduleTask(PinotTaskManager.java:405)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
org.apache.pinot.controller.helix.core.minion.PinotTaskManager.scheduleTasks(PinotTaskManager.java:383)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
org.apache.pinot.controller.helix.core.minion.PinotTaskManager.processTables(PinotTaskManager.java:477)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:68)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:120)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:85)
~[pinot-all-0.8.0-SNAPSHOT-jar-with-
dependencies.jar:0.8.0-SNAPSHOT-52272667e51acdf082b90766673dfcb77f6f9cc0] at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?] at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
[?:?] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?] at java.lang.Thread.run(Thread.java:834) [?:?] Finish running task:
PinotTaskManager in 126ms Creating new stream consumer, reason: Idle for too
long Can anyone help me with the same to understand the above error?  
**@npawar:** can you check the segment metadata of the completed segments? it
looks like they have incorrect start/end times.  
 **@anu110195:** How to do query on pinot's broker with increased timeout ?  
**@dadelcas:** Try passing `timeoutMs` as part of the query options  
**@adireddijagadesh:**  
 **@andruszd:** When i use the pinot api to check that the tables are all ok ,
using ('ingestionState') == "UNHEALTHY": I get the following error messages
Error Message for stream4_metric: Did not get any response from servers for
segment: stream4_metric__0__22__20210930T1201Z Segment Error for:->
stream4_metric__0__22__20210930T1201Z: {"code":404,​"error":"[] not found."}
and Error Message for stream5_events: Did not get any response from servers
for segment: stream5_events__0__0__20210819T1031Z Segment Error for:->
stream5_events__0__0__20210819T1031Z: {"code":404,​"error":"[] not found."} So
when i go to the server(x2) the steam4 file is totally missing and stream5 is
in a _tmp directory ls -l stream4_metric__0__22__20210930T1201Z ls: cannot
access stream4_metric__0__22__20210930T1201Z: No such file or directory There
are no files in _tmp and for the stream4 file it is in the _tmp directory ,
multiple times and as you can see has data .. ls
*stream5_events__0__0__20210819T1031Z* tmp-
stream5_events__0__0__20210819T1031Z-1629822465115:
tmp-c1054878-57f2-47f7-b7c2-085b5dfec02f tmp-
stream5_events__0__0__20210819T1031Z-1630594394292: tmp-
fdb518aa-929b-4eb3-9ecf-a4386ae9d084 tmp-
stream5_events__0__0__20210819T1031Z-1630661411117:
tmp-5f8ed825-d323-4344-b433-5287a7a32d4e tmp-
stream5_events__0__0__20210819T1031Z-1632927825055:
tmp-0f94cfe9-f024-4deb-9f72-8435b5315ee5 tmp-
stream5_events__0__0__20210819T1031Z-1633016079847: tmp-
df43885d-83c5-41fb-9eb1-0cfd1bfc6be9
/data/pinot/data/index/stream5_events_REALTIME/_tmp/tmp-
stream5_events__0__0__20210819T1031Z-1629822465115/tmp-c1054878-57f2-47f7-b7c2-085b5dfec02f
tmp-c1054878-57f2-47f7-b7c2-085b5dfec02f]$ ls -l total 24176 -rw-r--r--. 1
pinot pinot 20 Aug 24 17:27 device_id.dict -rw-r--r--. 1 pinot pinot 541 Aug
24 17:27 .unsorted.fwd -rw-r--r--. 1 pinot pinot 24638218 Aug 24 17:27
event.dict drwxr-xr-x. 2 pinot pinot 4096 Aug 24 17:27 event.json.idx.tmp
-rw-r--r--. 1 pinot pinot 1982 Aug 24 17:27 .unsorted.fwd -rw-r--r--. 1 pinot
pinot 9384 Aug 24 17:27 event_timestamp.dict -rw-r--r--. 1 pinot pinot 1982
Aug 24 17:27 .unsorted.fwd -rw-r--r--. 1 pinot pinot 0 Aug 24 17:27 .raw.fwd
-rw-r--r--. 1 pinot pinot 2 Aug 24 17:27 is_private.dict -rw-r--r--. 1 pinot
pinot 181 Aug 24 17:27 .unsorted.fwd -rw-r--r--. 1 pinot pinot 2 Aug 24 17:27
is_working_time.dict -rw-r--r--. 1 pinot pinot 181 Aug 24 17:27 .unsorted.fwd
-rw-r--r--. 1 pinot pinot 1920 Aug 24 17:27 namespace.dict -rw-r--r--. 1 pinot
pinot 1081 Aug 24 17:27 .unsorted.fwd -rw-r--r--. 1 pinot pinot 11528 Aug 24
17:27 rcvd_timestamp.dict -rw-r--r--. 1 pinot pinot 11528 Aug 24 17:27
.sorted.fwd -rw-r--r--. 1 pinot pinot 0 Aug 24 17:27 .raw.fwd So the question
is 1\. how to do remove phantom files from reporting when you check the tables
health and look at ingestionStatus .. 2\. why would the files for stream5
still be in _tmp and also not being seen when you do a table health check and
look at the ingestionStatus , and how do you fix this issue.  
**@camerronadams:** @camerronadams has joined the channel  
 **@gabuglc:** Hello guys, how do i create a multitenant cluster for my
production environment. It gives me the following error ```Executing command:
AddTenant -controllerProtocol http -controllerHost localhost -controllerPort
9001 -name krealoBrokerTenant -role BROKER -instanceCount 3
-offlineInstanceCount 0 -realTimeInstanceCount 3 -exec
{"_code":500,"_error":"Failed to create tenant"} {"_code":500,"_error":"Failed
to create tenant"}```  
**@mayanks:** Do you have enough untagged broker participants?  
**@mayanks:** Agree though, that the response should be user friendly  
**@gabuglc:** Just have 1  
**@gabuglc:** same error ```Executing command: AddTenant -controllerProtocol
http -controllerHost 0.0.0.0 -controllerPort 9001 -name krealoBrokerTenant
-role BROKER -instanceCount 1 -offlineInstanceCount 0 -realTimeInstanceCount 1
-exec {"_code":500,"_error":"Failed to create tenant"}
{"_code":500,"_error":"Failed to create tenant"}root@pinot-
controller:/opt/pinot#```  
**@mayanks:** You have 1 untagged, but you are asking for 3, I think that is
one issue  
**@gabuglc:** yeah modified for 1 here, still same error ```Executing command:
AddTenant -controllerProtocol http -controllerHost 0.0.0.0 -controllerPort
9001 -name krealoBrokerTenant -role BROKER -instanceCount 1
-offlineInstanceCount 0 -realTimeInstanceCount 1 -exec
{"_code":500,"_error":"Failed to create tenant"} {"_code":500,"_error":"Failed
to create tenant"}root@pinot-controller:/opt/pinot#```  

###  _#onboarding_

  
 **@son.nguyen.nam:** @son.nguyen.nam has joined the channel  

###  _#pinot-dev_

  
 **@adireddijagadesh:** @adireddijagadesh has joined the channel  

###  _#getting-started_

  
 **@adireddijagadesh:** @adireddijagadesh has joined the channel  

###  _#releases_

  
 **@son.nguyen.nam:** @son.nguyen.nam has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org