You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/11/17 02:00:18 UTC

Apache Pinot Daily Email Digest (2021-11-16)

### _#general_

  
 **@mrpringle:** I see version 0.9 is in rc, do we have a binary download link
for this version? Some nice new features to try out.  
**@mayanks:** It is not officially released yet, but should be shortly  
**@mayanks:** Just curious, what features were you interested in trying out
@mrpringle?  
**@mrpringle:** looking at the last with timestamp aggregation function, we
need this to do sums across pre aggregated totals  
 **@ansi395958:** @ansi395958 has joined the channel  
 **@karinwolok1:** :wave: Hello newbie Pinot community members! :wine_glass:
:partying_face: We're happy to have you here! Curious on what you're working
on and how you found Apache Pinot! Please introduce yourselves here in this
thread! :smiley: @ansi395958 @shantanoo.sinha @julien.picard @aarti.gaddale187
@bowenzhu @brandon @gabriel.nau @waqasdilawardaha @maitreyi.kv @nicholas.nezis
@dino.occhialini @scott.cohen @aaron.weiss @laabidi.raissi @nesrullayev.ali
@akshay13jain @zaid.mohemmad @alisonjanedavey @dtong @raluca.lazar @andre578
@ayush.network @xinxinzhenbang @sumit.l @nsanthanam @cgregor @diogodssantos
@mingfeng.tan @navi.trinity @stuartcoleman81 @stuart.coleman @ryan
@shreya.chakraborty @joseph.roldan @folutade @jurio0 @priyam @randxiexyy29
@stavg @rohitdev.kulshrestha @hamsemxiao @vivek.bi @yeongjukang @mail9deep  
 **@ashok.rex.2009:** @ashok.rex.2009 has joined the channel  
 **@troy:** @troy has joined the channel  
 **@sam:** @sam has joined the channel  
 **@cgregor:** Thanks @karinwolok1! Hi Everyone :wave: I'm currently working
an a set of automatic code transformations to help when migrating from Joda-
Time to java.time, I noticed  discussing migrating from Joda to java.time so
I'm interested in whether I can be of any help during this process. I am
currently just trying to get more familiar with pinot and it's components as I
haven't used it before. I will demo pinot to our engineering team once I have
a better grasp of it. If anyone is interested in discussing #7499 then i'm
keen to understand if I can be of any use!  

###  _#random_

  
 **@ansi395958:** @ansi395958 has joined the channel  
 **@ashok.rex.2009:** @ashok.rex.2009 has joined the channel  
 **@troy:** @troy has joined the channel  
 **@sam:** @sam has joined the channel  

###  _#feat-presto-connector_

  
 **@scott.cohen:** @scott.cohen has joined the channel  

###  _#troubleshooting_

  
 **@tony:** Backfill question -- we have a large REALTIME table (~900GB/day).
Due to a configuration error (ZK heap size too low) we lost some data because
the Kafka retention was less than the time to fix the bug. This has me
thinking of way to fill in missing data in the future for disaster recovery.
We have all the raw data sitting in Parquet files in our data lake. My initial
thought was to regenerate the segments with missing data (they are east to
identify). Is it possible to upload (refresh) REALTIME segments, assuming the
event time range is correct (there would be more events in the replacement
segment)? Or do I have to use a HYBRID table and either populate the OFFLINE
segments myself or use ?  
**@mayanks:** Right now, data push to realtime table is disabled, and needs a
managed offline flow. But afaik, Uber team is working on backfill support for
RT tables. Is this still the case @yupeng?  
**@yupeng:** right. we are working on such backfill pipeline in flink  
 **@nair.a:** Hi team, This is regarding batch ingestion from HDFS to
Offline_Table. After running the following command. *bin/pinot-ingestion-
job.sh -jobSpecFile /root/hdfsBatchIngestionSpec1.yaml* Getting the following
logs, segments are not getting created. ```Trying to create instance for class
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme hdfs, classname
org.apache.pinot.plugin.filesystem.HadoopPinotFS Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger
(org.apache.htrace.core.Tracer). log4j:WARN Please initialize the log4j system
properly. log4j:WARN See  for more info. No unit for dfs.client.datanode-
restart.timeout(30) assuming SECONDS No unit for dfs.client.datanode-
restart.timeout(30) assuming SECONDS The short-circuit local reads feature
cannot be used because libhadoop cannot be loaded. successfully initialized
HadoopPinotFS Creating an executor service with 1 threads(Job parallelism: 0,
available cores: 24.) Submitting one Segment Generation Task for  Using class:
org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader to read
segment, ignoring configured file format: AVRO Trying to create instance for
class
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme hdfs, classname
org.apache.pinot.plugin.filesystem.HadoopPinotFS successfully initialized
HadoopPinotFS Start pushing segments: []... to locations:
[org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@5d28bcd5] for
table poc_test_table```  
**@dunithd:** Is it possible to share the *hdfsBatchIngestionSpec1.yaml* with
us?  
**@nair.a:** BatchIngestionSpec File: ```executionFrameworkSpec: name:
'standalone' segmentGenerationJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
segmentMetadataPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'
jobType: SegmentCreationAndTarPush inputDirURI: '' outputDirURI: ''
overwriteOutput: true pinotFSSpecs: \- scheme: hdfs className:
org.apache.pinot.plugin.filesystem.HadoopPinotFS configs: hadoop.conf.path:
'/root/hadoop-3.0.0/etc/hadoop/' recordReaderSpec: dataFormat: 'parquet'
className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
tableSpec: tableName: 'poc_test_table' schemaURI: '' tableConfigURI: ''
pinotClusterSpecs: \- controllerURI: '' pushJobSpec: pushParallelism: 2
pushAttempts: 1```  
**@adireddijagadesh:** @nair.a Given `hadoop.conf.path`:
‘/root/hadoop-3.0.0/etc/hadoop/’ should contain hadoop XML configuration files
such as hdfs-site.xml, core-site.xml. Can you recheck provided path contains
config files or `/root/hadoop-3.0.0/etc/hadoop/conf/` is correct path  
**@nair.a:** yes, its present. From the logs, it seems , script is able to
connect to hadoop cluster. since it has listed the file.  
**@ken:** I think you might be missing some important logging output, given
the log4j warnings. Also, what version of Pinot are you running? Finally, what
happens if you try running a job for just segment generation, and do it
locally (download the Parquet file, and use local FS for input/output)?  
**@nair.a:** Will check logging conf once. We are running Pinot 0.8. Haven't
done ingestion with LocalFS. will try  
**@adireddijagadesh:** Any logs related to Starting/Ending of Segment Index
Creator  
 **@ansi395958:** @ansi395958 has joined the channel  
 **@lars-kristian_svenoy:** Hey everyone. Quick question; When querying for a
specific time range in Pinot, is it more efficient to use the primary time
column defined in the segmentsConfig, or is it equivalent to using any other
time column? From the docs it seems to indicate that the primary time column
is only used for retention purposes, meaning that querying for another
timestamp should be fine too. In my case, I am creating a copy of the primary
timestamp, reducing the granularity of it, and calling it `daysSinceEpoch`, as
I want to query for entities within certain days. ```"ingestionConfig": {
"transformConfigs": [ { "columnName": "daysSinceEpoch", "transformFunction":
"toEpochDays(documentTimestamp)" } ], ...``` Additionally, for the
RealtimeToOfflineSegmentsTask, I am using this value for deduplication
purposes. In the schema: ```"primaryKeyColumns": ["customerId", "machineId",
"daysSinceEpoch"] ...``` This is because for each event, I only want to keep
the latest in a day. Here’s the RealtimeToOfflineSegmentsTask ```
"RealtimeToOfflineSegmentsTask": { "bucketTimePeriod": "1d",
"bufferTimePeriod": "2d", "mergeType": "dedup", "maxNumRecordsPerSegment":
10000000, "roundBucketTimePeriod": "1h" }``` In the realtime table, I am also
filtering out any events older than 14 days (Where documentTimestamp is the
actual primary timeColumnName) ```"filterConfig": { "filterFunction":
"Groovy({documentTimestamp < (new Date() - 14).getTime()}, documentTimestamp)"
},``` Does that make sense?  
**@npawar:** you can use any time column. you’re right that primary time
column is mainly used for things like retention  
**@lars-kristian_svenoy:** That’s great, thank you @npawar
:slightly_smiling_face: I had assumed as much  
**@npawar:** you cannot really define your own primary keys for
realtimeToOfflineSegments task dedup mode. It will dedup only if the entire
row is same  
**@lars-kristian_svenoy:** Oh, it doesn’t use the primary key defined in the
schema?  
**@npawar:** the primary Key columns field you see is for the upsertts
feature. it doesnt take any effect for realtimeToOffline  
**@lars-kristian_svenoy:** aahh  
**@lars-kristian_svenoy:** Is there any reason why?  
**@npawar:** dedup is a relatively new feature in realtimeToOffline task. This
version only does the full row dedup. We’d need to add a lot more config and
code, to support the next level of smarter dedup  
**@npawar:** regarding filtering out events greater than 14d, you can just set
table retention to 14d? any reason you’re using the filter function instead?  
**@lars-kristian_svenoy:** I sometimes get old events coming in through kafka
which I don’t want to include in my segments  
**@mayanks:** @lars-kristian_svenoy You can filter those rows at ingestion
time:  
**@ashok.rex.2009:** @ashok.rex.2009 has joined the channel  
 **@troy:** @troy has joined the channel  
 **@mercyshans:** hi, team, any insight on this SQL issue I am trying to use
`distinctCount` aggregation function to count under different conditions
```select distinctCount(case when condition1 then colA else null end) as
condition1Count, distinctCount(case when condition2 then colA else null end)
as condition2Count, distinctCount(case when condition3 then colA else null
end) as condition3Count from tableA``` colA is type int or String. but looks
like it’s not supported in pinot cause null is not supported in the selection
query Will there be a future support for this.  
**@xiangfu0:** It requires same type for functions to be applied. You can cast
them to string always  
**@mercyshans:** do you mean change the `null` to `'null'`? I tried that by
then `'null'` is counted as one distinct value  
**@xiangfu0:** yes, right now the work around is to handle null as one
distinct value. The real null support in aggregation function will be
supported later  
**@mercyshans:** ok, thanks  
 **@sam:** @sam has joined the channel  

###  _#custom-aggregators_

  
 **@kis:** @kis has joined the channel  

###  _#pinot-dev_

  
 **@ashok.rex.2009:** @ashok.rex.2009 has joined the channel  

###  _#pql-sql-regression_

  
 **@kis:** @kis has joined the channel  

###  _#thirdeye-pinot_

  
 **@pyne.suvodeep:** Hi @shreya.chakraborty Please create a github issue in
itself.  

###  _#getting-started_

  
 **@kangren.chia:** will using `IdSet` with “NOT IN” clause have any
unintended performance impact? e.g. `select * from table where userid not in
IDSET(...)`  

###  _#releases_

  
 **@sam:** @sam has joined the channel  

###  _#debug_upsert_

  
 **@kkmagic99:** @kkmagic99 has joined the channel  

###  _#pinot-docsrus_

  
 **@bagi.priyank:** @bagi.priyank has joined the channel  
 **@bagi.priyank:** @bagi.priyank has left the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org