You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/04/08 02:00:38 UTC
Apache Pinot Daily Email Digest (2022-04-07)

### _#general_

  
 **@mannamra:** @mannamra has joined the channel  
 **@nrajendra434:** @nrajendra434 has joined the channel  
 **@fding:** @fding has joined the channel  
 **@fizza.abid:** @fizza.abid has joined the channel  
 **@diana.arnos:** Hey there! I have a different type of question this time:
If I had to give a presentation to my company advocating for us to start using
Pinot as a go-to tool for user-facing real-time analytics, which arguments or
points of view you would recommend me to speak about?  
**@diogo.baeder:** For BrandIndex, in YouGov, we had a huge issue with
performance when using PostgreSQL, so the biggest factor for us was
performance for analytics. But the second factor I'd say is the support for
multi-valued columns.  
**@ken:** The big wins for us were (a) using appropriate indices & star trees,
we could satisfy performance requirements for ad hoc queries, (b) SQL
interface made it easy for the UI layer to build dashboards, and (c) we could
bulk build segments (using Flink).  
**@mayanks:** This might also be helpful:  
**@arekchmura:** @arekchmura has joined the channel  
 **@arekchmura:** Hi everyone! I was wondering whether the dataset used  is
available somewhere (Airline data from 1987-2008). I am currently working on
my Master's thesis and I would like to run some experiments on that dataset.
Thanks  
**@mayanks:** Should be part of the integration tests in the Pinot code base.
But there might be better ones out there in the blogs.  
**@mitchellh:** has a link to the source of the dataset.  
**@mitchellh:** also,  might be interesting to you.  
**@arekchmura:** Thank you, that will be very helpful!  
 **@abhinav.wagle1:** Hi there, Checking community reviews. We are in process
of setting up Kubernetes-based deployment of Pinot Cluster. Has anyone seen
significant performance gains from using SSDs with instance store instead of
EBS for server PODs?  
**@mayanks:** Afaik, most folks end up using EBS and works well. Personally I
am unaware of a case where some use case had to move from EBS to SSD for perf.  
**@abhinav.wagle1:** @bagi.priyank: FYI  
**@abhinav.wagle1:** Thanks @mayanks!  
**@bagi.priyank:** right, i am not saying we must use instance store for our
use case. i am asking to compare ssd on ebs v/s instance store for our query
pattern. we saw considerable improvement in performance with our adhoc queries
with instance store during poc  
**@g.kishore:** yes, instance local storage will always be faster than the
remote ebs. My suggestion is to have the helm chart have both profiles. Start
with ebs but if you need even better performance, you can can chose to
dynamically shift between the two modes..  
**@g.kishore:** For e.g. you can have some tables on local and other tables on
ebs  
 **@noiarek:** @noiarek has joined the channel  

###  _#random_

  
 **@mannamra:** @mannamra has joined the channel  
 **@nrajendra434:** @nrajendra434 has joined the channel  
 **@fding:** @fding has joined the channel  
 **@fizza.abid:** @fizza.abid has joined the channel  
 **@arekchmura:** @arekchmura has joined the channel  
 **@noiarek:** @noiarek has joined the channel  

###  _#feat-compound-types_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#feat-text-search_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#feat-rt-seg-complete_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#feat-presto-connector_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#feat-upsert_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-helix_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#group-by-refactor_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#qps-metric_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#order-by_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#feat-better-schema-evolution_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#fraud_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinotadls_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#inconsistent-segment_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-power-bi_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#twitter_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#apa-16824_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-website_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#minion-star-tree_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#troubleshooting_

  
 **@mannamra:** @mannamra has joined the channel  
 **@alihaydar.atil:** Hello everyone, is it normal for pinot-server to flood
this log? i have noticed this after upgrading to 0.10.0 `[Consumer
clientId=consumer-null-808, groupId=null] Seeking to offset 59190962 for
partition mytopic-0`  
**@mayanks:** Seems like this is coming from the kafka consumer  
 **@nrajendra434:** @nrajendra434 has joined the channel  
 **@fding:** @fding has joined the channel  
 **@alihaydar.atil:** Hello everyone, If i don't set 'maxNumRecordsPerSegment'
config for my 'RealtimeToOfflineSegmentsTask', would it truncate my data if i
have more records than default value (it says 5.000.000 in docs) for that time
window?  
**@npawar:** If you have more than 5m (or whatever value is set in
maxNumRecords), it will generate multiple segments in that run, with 5m
records per segment. No truncation  
**@alihaydar.atil:** thank you for response :pray:  
 **@fizza.abid:** @fizza.abid has joined the channel  
 **@fizza.abid:** Hello everyone! I want to connect my s3 data to Apache
pinot? Can someone guide me about it? Is is possible through helm or I'll have
to create a job for ingestion? Currently, we don't use kafka.  
**@mark.needham:** There's a guide that shows how to import S3 files here --  
**@fizza.abid:** And can you tell where do we need to run this command? I have
configured it using helm and deployed on kubernetes.  
**@tisantos:** @fizza.abid you just need to create a Pinot table with the a
tableconfig containing the S3 ingestion properties. You can schedule the
ingestion via the `schedule` property or you can manually trigger via the
controller rest API.  
**@tisantos:** Check the /task/schedule API in swagger  
**@npawar:** @tisantos I believe Mark’s steps point to the LaunchIngestionJob
command and not the minion based ingestion.  
**@tisantos:** Ah i believe you're correct. In that case you should be able to
ssh into your controller and execute `pinot-admin.sh` script in the /bin
directory.  
 **@arekchmura:** @arekchmura has joined the channel  
 **@luisfernandez:** in the pinot docs we have this about  but how is this
starting the pinot infra via the IDE? I guess that ultimately my question is
how can I attach remote debugger to my local pinot processes  
**@mayanks:** It is suggesting to start the `quickStart` program, which
internally starts all pinot components within the same jvm. You can run Pinot
and debug it in IDE as you do any application.  
**@luisfernandez:** i got this exception do yo uknow why it may be?  
**@luisfernandez:** ```Instance 0.0.26.108_9000 is not leader of cluster
QuickStartCluster due to exception happen when session check
org.I0Itec.zkclient.exception.ZkInterruptedException:
java.lang.InterruptedException```  
**@luisfernandez:** i was trying to run the empty quick start model  
**@mayanks:** This is a newer feature. @kennybastani any idea on what might be
going on?  
**@kennybastani:** @luisfernandez What command are you using to start Pinot?  
**@luisfernandez:** like this sh pinot-admin.sh QuickStart -type EMPTY
-dataDir “usr/local/var/lib/pinot/data”  
**@kennybastani:** Do you have ZK running externally?  
**@luisfernandez:** no  
**@luisfernandez:** but i also don’t have zookeeper running locally do i have
to run zk manually first i thought this would start zk for me  
**@kennybastani:** Yes, it will  
**@kennybastani:** One sec  
**@kennybastani:** Please run this command  
**@kennybastani:** `netstat -vanp tcp | grep '*.2123\|9000\|8000\|7000'`  
**@kennybastani:** And let me know what the output is  
**@kennybastani:** Also, `ls /usr/local/var/lib/pinot/data/rawdata`  
**@kennybastani:** @luisfernandez Let me know if you got it solved. Happy to
jump on a call if you need help with anything.  
 **@diogo.baeder:** Hi folks! This could probably be a question more geared
towards @ken, but I'll ask broadly anyway: is there any documentation
available about how to implement ad-hoc segment replacement, in terms of what
this flow would be? I'll follow up in this thread.  
**@diogo.baeder:** What I want to have is a single table that holds data for
multiple regions and sectors within these regions. And I also want to be able
to partition the data by region and sector. The problem is that with the daily
ingestion I would do I would end up with far too many segments and they would
be too small, most of them not even with 1MB of data. So I thought about using
merge rollups - which some here recommended to me -, however that would
probably just merge everything together for each bucket, thus defeating my
partitioning per region and sector. Then I thought, I could just implement the
rolling up of these segments myself. The problem, though, is that I have no
idea how this works; How do I "build a segment"? Do I just create a batch job
for each rolled up segment, and then delete the old tiny ones? What's the
recommended way to approach this?  
**@mayanks:**  
 **@noiarek:** @noiarek has joined the channel  
 **@ysuo:** Hi team, my table segments show bad status. Queries on this table
return 305 error and segments are not available. I reset all segments and it
doesn’t work. What am I gonna do in this case? Thanks.  

###  _#pinot-s3_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-k8s-operator_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#onboarding_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#feat-geo-spatial-index_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#transform-functions_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#custom-aggregators_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#inconsistent-perf_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#docs_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#aggregators_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#tmp_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#query-latency_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#dhill-date-seg_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#enable-generic-offsets_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-dev_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#community_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#feat-pravega-connector_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#announcements_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#s3-multiple-buckets_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#release-certifier_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#multiple_streams_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#lp-pinot-poc_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#roadmap_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#presto-pinot-connector_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#multi-region-setup_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#metadata-push-api_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pql-sql-regression_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#latency-during-segment-commit_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-realtime-table-rebalance_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#release060_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#time-based-segment-pruner_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#discuss-validation_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#segment-cold-storage_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#new-office-space_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#config-tuner_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#test-channel_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-perf-tuning_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#thirdeye-pinot_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#getting-started_

  
 **@mannamra:** @mannamra has joined the channel  
 **@nrajendra434:** @nrajendra434 has joined the channel  
 **@fding:** @fding has joined the channel  
 **@fizza.abid:** @fizza.abid has joined the channel  
 **@arekchmura:** @arekchmura has joined the channel  
 **@luisfernandez:** i’m trying to import at least 2 years worth of data I was
looking to see if I could get some guidance on how to go about this, I have
been taking a look at the ingestion job framework, is this the way to go about
this? what are some of the considerations we have to make when doing this
backfills. I see that data is divided by folders which are the days and each
of these days will be a segment on pinot, is that right? how do we ensure that
the data we are ingesting will still perform well? and what are some of the
tips that you could give when moving a lot of data?  
**@xiangfu0:** general guideline is to pre-partition data by date, then you
will have multiple raw data files per day, and each data file will become one
pinot segment, 1:1 mapping.  
**@xiangfu0:** for ingestion, the segment creation and push are external
process or you can start a set of nodes of pinot minions to do the job  
**@xiangfu0:** that will not impact your runtime pinot servers  
**@xiangfu0:** for data push, set the push parallelism to ensure you won’t
exhaust pinot controller.  
**@luisfernandez:** right as explained here,  and in short as you said each of
those files will be a segment, how do i know my segment size is okay?  
**@luisfernandez:** for each of the files  
**@luisfernandez:** right now we have a hybrid model and this is hour configs
for our current segments in the realtime side of it:  
**@luisfernandez:** ``` "realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "24h",
"realtime.segment.flush.segment.size": "250M"```  
**@luisfernandez:** another question that i had is how these configs:
```"ingestionConfig": { "batchIngestionConfig": { "segmentIngestionType":
"APPEND", "segmentIngestionFrequency": "HOURLY" } },``` impact the offline
table  
**@mayanks:** ```segmentIngestionType - Used for data retention
segmentIngestionFrequency - Used to compute time-boundary for hybrid tables```  
**@luisfernandez:** thank you mayank  
**@luisfernandez:** also to explain our current setup we have this:
```realtime table with 7 days retention, offline table with 2 years retention
(realtime data is eventually moved here)``` we want to backfill the offline
table with data that is on the system that we are moving away from to pinot
into the offline table, is this the way people usually do it or do we usually
create another offline table that does backfilling only  
**@mayanks:** You can backfill a hybrid table.  
 **@noiarek:** @noiarek has joined the channel  

###  _#feat-partial-upsert_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot_website_improvement_suggestions_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#segment-write-api_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#releases_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#metrics-plugin-impl_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#debug_upsert_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#flink-pinot-connector_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-rack-awareness_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#minion-improvements_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#fix-numerical-predicate_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#complex-type-support_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#fix_llc_segment_upload_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#product-launch_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-docsrus_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#pinot-trino_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#kinesis_help_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#udf-type-matching_

  
 **@adam.hutson:** @adam.hutson has joined the channel  

###  _#jobs_

  
 **@adam.hutson:** @adam.hutson has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org