You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/03/06 02:00:13 UTC

Apache Pinot Daily Email Digest (2021-03-05)

### _#general_

  
 **@humengyuk18:** Does Pinot support change schema existing column name? I
tried change a column name, but got following exceptions on query: ```[ {
"errorCode": 500, "message": "MergeResponseError:\nData schema mismatch
between merged block:
[time_to_hour(LONG),age_decade(STRING),age_level(STRING),city(STRING),company_id(STRING),company_name(STRING),count_impression(LONG),count_in(LONG),count_passby(LONG),create_time(LONG),day(STRING),day_in_week(STRING),district(STRING),gate_id(STRING),gender(STRING),holiday_id(STRING),holiday_name(STRING),hour(STRING),is_holiday(STRING),month(STRING),province(STRING),region(STRING),shop_id(STRING),shop_name(STRING),temperature(STRING),temperature_id(STRING),total_duration(LONG),total_impression_duration(LONG),weather_cate_id(STRING),weather_cate_name(STRING),year(STRING)]
and block to merge:
[time_to_hour(LONG),age_decade(STRING),age_level(STRING),city(STRING),company_id(STRING),company_name(STRING),count_impression(LONG),count_in(LONG),count_passby(LONG),create_time(LONG),day(STRING),day_in_week(STRING),district(STRING),gate_id(STRING),gender(STRING),holiday_id(STRING),holiday_name(STRING),hour(STRING),is_holiday(STRING),month(STRING),province(STRING),region(STRING),shop_id(STRING),shop_name(STRING),temperature(STRING),temperature_id(STRING),total_duration(LONG),total_impression_duraion(LONG),weather_cate_id(STRING),weather_cate_name(STRING),year(STRING)],
drop block to merge" } ]```  
**@mayanks:** Hello, schema evolution is supported as long as it is backward
compatible. Changing a column name or type is considered backward
incompatible, and is not supported  
**@humengyuk18:** Thanks, so in this case, I should delete all the segment and
re-ingest all the data?  
**@mayanks:** Yes, for incompatible schema change, that is the option  
**@pankaj:** If we extend a table schema in Pinot to add new columns (so it
does not break backward compatibility); do we have to backfill data or can
Pinot use null/default values to handle the older segments?  
**@mayanks:** Pinot can auto fill null/default value in this case  
**@npawar:** Pinot can also fill derived value i.e. if the value of new column
is derived from existing columns, Pinot will calculate it using the function
you provide  
 **@1705ayush:** *How to ingest Data into pinot on kubernetes using native
batch ingestion?* Hi, I am trying to ingest csv data into pinot deployed on
kubernetes using LaunchDataIngestionJob arg. I have verified that the table
has been created on pinot and the job-spec and csv data are present on the
node. This is my job-spec file ```apiVersion: batch/v1 kind: Job metadata:
name: pinot-case-offline-ingestion namespace: my-pinot-kube spec: template:
spec: containers: \- name: pinot-load-case-offline image:
apachepinot/pinot:0.3.0-SNAPSHOT args: ["LaunchDataIngestionJob",
"-jobSpecFile", "/opt/data/table-configs/case_history/job-spec.yml"]
volumeMounts: \- name: mount-data mountPath: /opt/data restartPolicy:
OnFailure volumes: \- name: mount-data hostPath: path: /opt/data backoffLimit:
100``` After applying this job to node, nothing happens and this is the log of
the pod. ```SegmentGenerationJobSpec:
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null executionFrameworkSpec: {extraConfigs: null,
name: standalone, segmentGenerationJobRunnerClassName:
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
segmentTarPushJobRunnerClassName:
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
segmentUriPushJobRunnerClassName:
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**/*.csv inputDirURI:
/opt/data/csv_data/case_prod_data jobType: SegmentCreationAndTarPush
outputDirURI: /pinot-segments/case_history overwriteOutput: true
pinotClusterSpecs: \- {controllerURI: ''} pinotFSSpecs: \- {className:
org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: null recordReaderSpec: className:
org.apache.pinot.plugin.inputformat.csv.CSVRecordReader configClassName:
org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig configs:
{delimiter: '|', multiValueDelimiter: ''} dataFormat: csv
segmentNameGeneratorSpec: configs: {segment.name.prefix: case_history,
exclude.sequence.id: 'true'} type: normalizedDate tableSpec: {schemaURI: null,
tableConfigURI: null, tableName: case_history} Trying to create instance for
class
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname
org.apache.pinot.spi.filesystem.LocalPinotFS``` Am I ingesting the data
incorrectly ?  
**@fx19880617:** I think you are missing pushJobSpec?  
**@fx19880617:** ```pushJobSpec: null```  
**@1705ayush:** Hi @fx19880617, Thank you for helping. I tried adding
pushJobSpec to job-spec ```pushJobSpec: pushParallelism: 2 pushAttempts: 2
pushRetryIntervalMillis: 1000``` But the job gets completed with no errors.
And the pod log is ```SegmentGenerationJobSpec:
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null executionFrameworkSpec: {extraConfigs: null,
name: standalone, segmentGenerationJobRunnerClassName:
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
segmentTarPushJobRunnerClassName:
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
segmentUriPushJobRunnerClassName:
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**/*.csv inputDirURI:
/opt/data/csv_data/case_prod_data jobType: SegmentCreationAndTarPush
outputDirURI: /pinot-segments/case_history overwriteOutput: true
pinotClusterSpecs: \- {controllerURI: ''} pinotFSSpecs: \- {className:
org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: {pushAttempts: 2, pushParallelism: 2, pushRetryIntervalMillis:
1000, segmentUriPrefix: null, segmentUriSuffix: null} recordReaderSpec:
className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig
configs: {delimiter: '|', multiValueDelimiter: ''} dataFormat: csv
segmentNameGeneratorSpec: configs: {segment.name.prefix: case_history,
exclude.sequence.id: 'true'} type: normalizedDate tableSpec: {schemaURI: null,
tableConfigURI: null, tableName: case_history} Trying to create instance for
class
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname
org.apache.pinot.spi.filesystem.LocalPinotFS```  
**@fx19880617:** ok  
**@fx19880617:** what’s the logs for the job?  
**@1705ayush:** Here is the log of the job: ```16:26:48:ayush@:pinot :alien:
kubectl -n my-pinot-kube describe jobs.batch pinot-case-offline-ingestion
Name: pinot-case-offline-ingestion Namespace: my-pinot-kube Selector:
controller-uid=25b4e843-b600-4de2-a2ad-584ac8ce17b5 Labels: controller-
uid=25b4e843-b600-4de2-a2ad-584ac8ce17b5 job-name=pinot-case-offline-ingestion
Annotations: <none> Parallelism: 1 Completions: 1 Start Time: Fri, 05 Mar 2021
16:26:41 -0500 Completed At: Fri, 05 Mar 2021 16:26:44 -0500 Duration: 3s Pods
Statuses: 0 Running / 1 Succeeded / 0 Failed Pod Template: Labels: controller-
uid=25b4e843-b600-4de2-a2ad-584ac8ce17b5 job-name=pinot-case-offline-ingestion
Containers: pinot-load-case-offline: Image: apachepinot/pinot:0.3.0-SNAPSHOT
Port: <none> Host Port: <none> Args: LaunchDataIngestionJob -jobSpecFile
/opt/data/table-configs/case_history/job-spec.yml Environment: <none> Mounts:
/opt/data from mount-data (rw) Volumes: mount-data: Type: HostPath (bare host
directory volume) Path: /opt/data HostPathType: Events: Type Reason Age From
Message \---- ------ ---- ---- ------- Normal SuccessfulCreate 27s job-
controller Created pod: pinot-case-offline-ingestion-mfvrx Normal Completed
24s job-controller Job completed``` The following is the job spec file to
refer. What should be the pinotClusterSpecs.controllerURI value? I tried
changing it to anything gibberish and I faced the same logs. I think, my value
of pinotClusterSpecs.controllerURI is incorrect. ```executionFrameworkSpec:
name: 'standalone' segmentGenerationJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush inputDirURI:
'/opt/data/csv_data/case_prod_data' includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/pinot-segments/case_history' overwriteOutput: true
pinotFSSpecs: \- scheme: file className:
org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat:
'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName:
'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' configs:
delimiter: '|' multiValueDelimiter: '' tableSpec: tableName: 'case_history'
pinotClusterSpecs: # - controllerURI: 'pinot-controller:9000' \-
controllerURI: '' segmentNameGeneratorSpec: type: normalizedDate configs:
segment.name.prefix: 'case_history' exclude.sequence.id: true pushJobSpec:
pushParallelism: 2 pushAttempts: 2 pushRetryIntervalMillis: 1000```  
**@fx19880617:** then are there data on `/opt/data/csv_data/case_prod_data`  
**@1705ayush:** yes. I checked by running a ubuntu container and bashed into
it. there is data present on this path  
**@fx19880617:** can you try a newer image as well  
**@fx19880617:** ```apachepinot/pinot:0.6.0```  
**@fx19880617:** 0.3.0 is very old image which I cannot recall the details  
**@1705ayush:** ok. so changed the image. it worked. at the very end of the
log it says ```Response for pushing table case_history segment case_history to
location  \- 200: {"status":"Successfully uploaded segment: case_history of
table: case_history"}```  
**@1705ayush:** But, wondering why I cannot query it on the pinot query UI  
**@1705ayush:** there are no records returned from the query select * from
case_history limit 10  
**@fx19880617:** hmm  
**@fx19880617:** it should be  
**@1705ayush:** seems, like another issue that I have to look into. But
anyways, thank you very much @fx19880617 for you promt responses and help. The
new image worked out well.  
**@fx19880617:** can you check pinot server log?  
**@fx19880617:** seems like so  
**@1705ayush:** ok. I do see some errors on pinot-server.  
**@1705ayush:** ```2021/03/05 20:45:00.943 INFO [HelixServerStarter] [Start a
Pinot [SERVER]] Starting Pinot server 2021/03/05 20:45:00.944 INFO
[HelixServerStarter] [Start a Pinot [SERVER]] Initializing Helix manager with
zkAddress: pinot-zookeeper:2181, clusterName: pinot-quickstart, instanceId:
Server_pinot-server-0.pinot-server-headless.my-pinot-
kube.svc.cluster.local_8098 2021/03/05 20:45:02.560 INFO [HelixServerStarter]
[Start a Pinot [SERVER]] Initializing server instance and registering state
model factory 2021/03/05 20:45:51.252 INFO [HelixServerStarter] [Start a Pinot
[SERVER]] Connecting Helix manager 2021/03/05 20:46:42.537 WARN [ClientCnxn]
[Start a Pinot [SERVER]-SendThread(pinot-zookeeper:2181)] Client session timed
out, have not heard from server in 31084ms for sessionid 0x0 2021/03/05
20:46:44.353 WARN [ParticipantHealthReportTask] [Start a Pinot [SERVER]]
ParticipantHealthReportTimerTask already stopped 2021/03/05 20:47:10.343 WARN
[CallbackHandler] [Start a Pinot [SERVER]] Callback handler received event in
wrong order. Listener:
org.apache.helix.messaging.handling.HelixTaskExecutor@2767bcd8, path: /pinot-
quickstart/INSTANCES/Server_pinot-server-0.pinot-server-headless.my-pinot-
kube.svc.cluster.local_8098/MESSAGES, expected types: [CALLBACK, FINALIZE] but
was INIT 2021/03/05 20:47:11.245 INFO [HelixServerStarter] [Start a Pinot
[SERVER]] Instance config for instance: Server_pinot-server-0.pinot-server-
headless.my-pinot-kube.svc.cluster.local_8098 has instance tags:
[DefaultTenant_OFFLINE, DefaultTenant_REALTIME], host: pinot-server-0.pinot-
server-headless.my-pinot-kube.svc.cluster.local, port: 8098, no need to update
2021/03/05 20:47:11.249 INFO [HelixServerStarter] [Start a Pinot [SERVER]]
Using class: org.apache.pinot.server.api.access.AllowAllAccessFactory as the
AccessControlFactory 2021/03/05 20:47:11.455 INFO [HelixServerStarter] [Start
a Pinot [SERVER]] Starting server admin application on:  2021/03/05
20:47:13.650 WARN [ClientCnxn] [Start a Pinot [SERVER]-SendThread(pinot-
zookeeper:2181)] Session 0x10001285ff10004 for server pinot-
zookeeper/10.107.87.233:2181, unexpected error, closing socket connection and
attempting reconnect java.io.IOException: Connection reset by peer at
sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_282] at
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_282] at
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_282] at
sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_282] at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[?:1.8.0_282]
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223) [pinot-
all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
2021/03/05 20:47:46.344 WARN [ZKHelixManager] [ZkClient-EventThread-16-pinot-
zookeeper:2181] KeeperState:Disconnected, SessionId: 10001285ff10004,
instance: Server_pinot-server-0.pinot-server-headless.my-pinot-
kube.svc.cluster.local_8098, type: PARTICIPANT Mar 05, 2021 8:48:39 PM
org.glassfish.grizzly.http.server.NetworkListener start INFO: Started listener
bound to [0.0.0.0:8097] Mar 05, 2021 8:48:40 PM
org.glassfish.grizzly.http.server.HttpServer start INFO: [HttpServer] Started.
2021/03/05 20:48:41.841 WARN [ZKHelixManager] [ZkClient-EventThread-16-pinot-
zookeeper:2181] KeeperState:Disconnected, SessionId: 10001285ff10004,
instance: Server_pinot-server-0.pinot-server-headless.my-pinot-
kube.svc.cluster.local_8098, type: PARTICIPANT 2021/03/05 20:50:17.063 WARN
[ZKHelixManager] [ZkClient-EventThread-16-pinot-zookeeper:2181]
KeeperState:Disconnected, SessionId: 10001285ff10004, instance: Server_pinot-
server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098, type:
PARTICIPANT 2021/03/05 20:51:06.653 ERROR [StartServiceManagerCommand] [Start
a Pinot [SERVER]] Failed to start a Pinot [SERVER] at 368.2 since launch
org.apache.helix.HelixException: fail to set config. cluster: pinot-quickstart
is NOT setup. at org.apache.helix.ConfigAccessor.set(ConfigAccessor.java:300)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.helix.manager.zk.ZKHelixAdmin.setConfig(ZKHelixAdmin.java:1092)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.pinot.server.starter.helix.HelixServerStarter.start(HelixServerStarter.java:361)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.pinot.tools.service.PinotServiceManager.startServer(PinotServiceManager.java:150)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.pinot.tools.service.PinotServiceManager.startRole(PinotServiceManager.java:95)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.pinot.tools.admin.command.StartServiceManagerCommand$1.lambda$run$0(StartServiceManagerCommand.java:260)
~[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.pinot.tools.admin.command.StartServiceManagerCommand.startPinotService(StartServiceManagerCommand.java:286)
[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.pinot.tools.admin.command.StartServiceManagerCommand.access$000(StartServiceManagerCommand.java:57)
[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e] at
org.apache.pinot.tools.admin.command.StartServiceManagerCommand$1.run(StartServiceManagerCommand.java:260)
[pinot-all-0.7.0-SNAPSHOT-jar-with-
dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
2021/03/05 21:37:47.170 WARN [ConfigAccessor] [ZkClient-EventThread-16-pinot-
zookeeper:2181] No config found at /pinot-
quickstart/CONFIGS/RESOURCE/case_history_OFFLINE```  
**@1705ayush:** I dont know why it is looking for pinot-quickstart configs  
**@fx19880617:** hmm when you start pinot server did you give a clustername?  
**@1705ayush:** I start pinot using helm like ```kubectl create ns my-pinot-
kube helm install pinot /home/ayush/spyne/incubator-
pinot/kubernetes/helm/pinot -n my-pinot-kube --set replicas=1```  
**@fx19880617:** hmmm  
**@fx19880617:** can you describe the statefulset of pinot-controller and
pinot-server and see what's the arguments for that  
**@1705ayush:** ok. All the pinot workers are in running state. I do see these
2 errors on pinot-controller ```WARN [PinotInstanceRestletResource] [grizzly-
http-server-1] Admin port is not set for instance: Server_pinot-
server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098 ... ...```
```WARN [PinotInstanceRestletResource] [grizzly-http-server-1] Grpc port is
not set for instance: Controller_pinot-controller-0.pinot-controller-
headless.my-pinot-kube.svc.cluster.local_9000 ... ...```  
**@1705ayush:** or, I think this could mean something (log on pinot-
controller) ```WARN [SegmentStatusChecker] [pool-7-thread-2] Table
case_history_OFFLINE has 1 segments with no online replicas WARN
[SegmentStatusChecker] [pool-7-thread-2] Table case_history_OFFLINE has 0
replicas, below replication threshold :1```  
**@fx19880617:** this means your controller is up, but no pinot server is
connected to the cluster  
**@fx19880617:** i feel something goes wrong with the server setup  
**@fx19880617:** can you try to restart pinot-server pod and see if it's
reconnecting?  
**@1705ayush:** yes. restarting the node  
**@1705ayush:** yes. restarting the node worked out! Thank you very much
@fx19880617. :pray:  
**@fx19880617:** cool!  
**@fx19880617:** I think the issue is that pinot server pod started before
pinot controller which requires setup the zookeeper structure  
**@fx19880617:** so restart should fix it  
**@1705ayush:** yes. whenever I start using helm, zookeeper and controller are
the last ones to start and because of that server and broker takes multiple
restarts.  

###  _#segment-write-api_

  
 **@npawar:** Hey @yupeng , here's the branch i'm working on:  
**@npawar:** i have a basic no-frills file based impl in there. everything is
sync and single threaded at the moment.  
 **@npawar:** But should be good enough if you want to start trying it out in
your flink connector POC  
 **@npawar:** if you do try it, lmk if you have any feedback  
 **@yupeng:** thanks!  
 **@yupeng:** that’s fast  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org