You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/06/02 02:00:24 UTC

Apache Pinot Daily Email Digest (2021-06-01)

### _#general_

  
 **@hongtaozhang:** @hongtaozhang has joined the channel  
 **@krishnalumia535:** @krishnalumia535 has joined the channel  
 **@kaustabhganguly:** @kaustabhganguly has joined the channel  
 **@kaustabhganguly:** Hii everyone ! I'm Kaustabh from India  
 **@kaustabhganguly:** I'm a fresh CS grad and just exploring things. I am new
to streaming data, kafka and pinot. I want to merge batched data and streaming
data and use pinot on top of it. My solution is to use Kafka connect as it's
an ideal solution for merging batched and streaming data into topics &
partitions. So my pipeline is basically using kafka for merging and then using
pinot for streaming from kafka. *Is there a better solution that comes across
anyone's mind ? Please correct me if there's any fallacy in my logic.*  
**@mayanks:** Since Pinot can ingest data from offline directly, you could
simply have Pinot ingest from a separate offline pipeline as well as Kafka
stream.  
**@kaustabhganguly:** Thanks. Trying that out. Will ask here if I have some
doubts.  
**@mayanks:** Yes, feel free to ask any questions here  
**@mayanks:**  
**@mayanks:** Also see if you can find your answer ^^. If not, perhaps we can
improve the docs  
**@kaustabhganguly:** Sure thing.  
 **@pedro.cls93:** Hello, How does Pinot decide if a field in an incoming
message is null to apply the defaultNullValue? Does the key of the field have
to be missing? For a String field of name `fieldX` with default value
`"default"`, ```{ "schemaName": "HitExecutionView", "dimensionFieldSpecs": [ {
"name": "fieldX", "dataType": "STRING", "defaultNullValue": "default" },...,
]}``` if an incoming message has the following payload: ```{ ..., "fieldX":
null, ..., }``` What is the expected value in Pinot? `null` or `"default"` ?  
 **@mayanks:** I’m the incoming null gets translated into default null value
and stored in Pinot. So in your example, “default” will be stored  
**@pedro.cls93:** I'm seeing differently, would you mind joining a call with
me and taking a look?  
 **@anusha.munukuntla:** @anusha.munukuntla has joined the channel  
 **@kylebanker:** @kylebanker has joined the channel  
 **@ken:** My ops guy is setting up Docker containers, and wants to know why
the base Pinot Dockerfile has ```VOLUME ["${PINOT_HOME}/configs",
"${PINOT_HOME}/data"]``` since he sees that there’s nothing being stored in
the `/data` directory. Any input?  
**@mayanks:** Servers will store local copy of segments there?  
**@ken:** But normally local copies of segments are stored in `/tmp/xxx`, or
so I thought?  
**@dlavoie:** By defaults, the OSS helm chart will configure $HOME/data as the
data dir for pinot  
**@dlavoie:** It’s in line with the default value of `controller.data.dir` of
the helm chart.  
**@ken:** Hmm, OK. So since we’re using HDFS as the deep store, this wouldn’t
be getting used, right?  
**@dlavoie:** Indeed  
**@dlavoie:** But keep in mind that servers will use that path  
**@dlavoie:** So the volume defined in the docker image is relevant for the
segments stored by the servers.  
**@ken:** But wouldn’t you want that to be temp storage, and not mapped
outside of Docker?  
**@dlavoie:** Nope  
**@dlavoie:** It’s the same as kafka  
**@dlavoie:** sure  
**@dlavoie:** brokers can rebuild their data from other replicas and deepstore
and everything  
**@dlavoie:** But, trust me, if you want to avoid network jittering when your
server are restarting, you’ll be happy with a persistent volume of your
segments for the servers  
**@dlavoie:** Segment FS hosted by server should not be considered temporary  
**@dlavoie:** Deepstore download is a fallback in case of lost  
**@ken:** I’ll have to poke around in one of our server processes to see why
the ops guy thinks there’s nothing in /data  
**@ken:** Thanks for the input  
**@dlavoie:** Check how your server data dir is configured  
**@dlavoie:** If you want to speed up server restart and avoid redownloading
segments from deepstore, configuring the data dir of server in a persistent
volume will improve stability of your cluster greatly when things go wrong  
**@ken:** Right. So this would be a `server.data.dir` configuration value?  
**@dlavoie:** `pinot.server.instance.dataDir` :upside_down_face:  
**@dlavoie:** the takeaway is that the volume defined in the dockerfile is
opiniated with the oss helm chart and not aligned with the default values from
the…. dockerfile itself…  
**@ken:** Nice. I guess `pinot.server.instance.segmentTarDir` can be a temp
dir then.  
**@dlavoie:** not exactly  
**@dlavoie:** turns out it more subtle than that :slightly_smiling_face:  
**@dlavoie:** ``` dataDir: /var/pinot/server/data/index segmentTarDir:
/var/pinot/server/data/segment```  
**@dlavoie:** `pinot.server.instance.dataDir` is the index storage location,
and `pinot.server.instance.segmentTarDir` is the tgz dir  
**@dlavoie:** helm chart stores them both in the same `data` volume of the
dockerfile  
**@ken:** OK - seems like  could use some editing love. Currently says for
`pinot.server.instance.dataDir` “Directory to hold all the data”, and for
`pinot.server.instance.segmentTarDir` “Directory to hold temporary segments
downloaded from Controller or Deep Store”. But based on above, it’s not “all
the data”, and it’s not (really) “temporary segments”.  
**@dlavoie:** The definition of temporary can be loose maybe? :smile:  
**@dlavoie:** If you need to rebuild the segment indexes, there’s value in
having the tgz persisted.  
**@dlavoie:** If the definition of `all the data` is the what is on the query
path, it is accurate :stuck_out_tongue:  
**@ken:** But if only indexes go into `pinot.server.instance.dataDir`, then
you’d need to access the tgz to get data in a column that doesn’t have an
index on it.  
**@dlavoie:** @fx19880617 to the rescue for that last one
:slightly_smiling_face:  
**@ken:** :slightly_smiling_face: I’ll see what he says when I’m back online
after dinner…  
**@ken:** Thanks again  
**@dlavoie:** my pleasure!  

###  _#random_

  
 **@hongtaozhang:** @hongtaozhang has joined the channel  
 **@krishnalumia535:** @krishnalumia535 has joined the channel  
 **@kaustabhganguly:** @kaustabhganguly has joined the channel  
 **@anusha.munukuntla:** @anusha.munukuntla has joined the channel  
 **@kylebanker:** @kylebanker has joined the channel  

###  _#troubleshooting_

  
 **@hongtaozhang:** @hongtaozhang has joined the channel  
 **@chxing:** Hi All. If We build a pinot cluster without deep storage , The
controller will store all the segments in controller disk configured in
`controller.data.dir`, If there have some methods to delete controller
segments with retention times, since we don’t have enough disk space to store
in controller?  
**@npawar:** In the table config, you can set retentionTimeValue and
retentionTimeUnit. Check out the table config documentation  
**@npawar:**  
**@chxing:** Thx for your replay, But I am confused that, if we keep table
retentionTimeValue very long like 1 year , all the segments still need stored
in controller, which will make controller node full quickly  
**@npawar:** Typically, you would attach an NFS to the contollers for this.  
**@npawar:** @fx19880617 any other suggestions from your experience ? ^^  
 **@krishnalumia535:** @krishnalumia535 has joined the channel  
 **@pedro.cls93:** Hi guys, is there a safeguard when applying ingestion
transformations if the input field is the default value? I.e: Given this
transformation: ```{ "columnName": "dateOfBirthMs", "transformFunction":
"fromDateTime(dateOfBirth, 'yyyy-MM-dd''T''HH:mm:ss''Z')" }``` And schema
definitions: ```"dimensionFieldSpecs": [ ,..., { "name": "dateOfBirth",
"dataType": "STRING" },..., ], "dateTimeFieldSpecs": [ ..., { "name":
"dateOfBirthMs", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS" } ],``` I get this exception:
```java.lang.IllegalStateException: Caught exception while invoking method:
public static long
org.apache.pinot.common.function.scalar.DateTimeFunctions.fromDateTime(java.lang.String,java.lang.String)
with arguments: [null, yyyy-MM-dd'T'HH:mm:ss'Z]``` I was under the impression
that Pinot would not apply the transformation if the input field is null or
that the transformation itself would be resilient. Is there any way around
this?  
**@mayanks:** Yes transform functions should be able to handle nulls. For
workaround you can convert it into a groovy function and add the null check
for now.  
**@pedro.cls93:** A follow-up question.... How does Pinot decide if a field in
an incoming message is null to apply the defaultNullValue? Does the key of the
field have to be missing?  
**@pedro.cls93:** For a String field of name `fieldX` with default value
`"default"`, if an incoming message has the following payload: ```{ ...,
"fieldX": null, ..., }``` What is the expected value in Pinot? `null` or
`"default"`  
 **@kaustabhganguly:** @kaustabhganguly has joined the channel  
 **@anusha.munukuntla:** @anusha.munukuntla has joined the channel  
 **@kylebanker:** @kylebanker has joined the channel  

###  _#pinot-dev_

  
 **@anusha.munukuntla:** @anusha.munukuntla has joined the channel  

###  _#getting-started_

  
 **@kmvb.tau:** @kmvb.tau has joined the channel  

###  _#fix_llc_segment_upload_

  
 **@changliu:** Hi @ssubrama, @tingchen, I refactored the ZK access when we
get the list of segments for upload retry. Would you mind taking another look?
1\. During commit phase, enqueue the segment without download url. 2\.
`uploadToSegmentStoreIfMissing` reads in-memory segments list to fix, only
removes the segment from the in-memory list after successful upload. 3\. Only
`prefetchLLCSegmentsWithoutDeepStoreCopy` has access to ZK, which is triggered
when setup the periodic jobs. Leadership is also checked in this step. And to
further alleviate the ZK access, the filter based on segment creation time is
added.  
**@changliu:** @ssubrama your concern about shared constant for time limit
check is addressed. I added another constant for it:
`MIN_TIME_BEFORE_FIXING_SEGMENT_STORE_COPY_MILLIS`  
 **@ssubrama:** I will look at it in the next couple of days. thanks for
addressing allcomments  
 **@changliu:** Thanks Subbu  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org