You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "cbalci (via GitHub)" <gi...@apache.org> on 2023/03/08 01:29:12 UTC

[GitHub] [pinot] cbalci opened a new pull request, #10394: Pinot Spark Connector for Spark3

cbalci opened a new pull request, #10394:
URL: https://github.com/apache/pinot/pull/10394

   **Background**
   Apache Spark has [changed](https://blog.madhukaraphatak.com/spark-3-datasource-v2-part-3) the Datasource interface significantly between Spark2 and Spark3, so current pinot-spark-connector doesn't work for Spark3. In a previous PR(#10321) I refactored the spark-connector into two modules (`pinot-spark-common` and `pinot-spark-2-connector`) to be able to reuse shared logic which gives us a clean base to implement the new version.
   
   **Change**
   In this PR I'm implementing the DataSourceV2 interface as published by Spark3. Functionality is exactly same as Pinot Spark 2 Connector and it supports all existing configuration options such as:
   - Ability to read from REALTIME, OFFLINE or HYBRID Pinot tables
   - Ability to scan using HTTP or GRPC server endpoints
   - Column pruning and filter push down
   - etc. (see docs)
   
   
   It can be used as a drop in replacement when migrating from Spark2 to Spark3. Spark3 also brings some new features and improvements such as 'Aggregation push down' which can be taken advantage of in the future.
   
   **Testing**
   I added basic unit test coverage as well as a good list of integration tests under `ExampleSparkPinotConnectorTest` similar to Spark2 Connector. 
   
   `feature`
   `release-notes` (Added Spark3 support for Pinot Spark Connector)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on pull request #10394: Pinot Spark Connector for Spark3

Posted by "xiangfu0 (via GitHub)" <gi...@apache.org>.
xiangfu0 commented on PR #10394:
URL: https://github.com/apache/pinot/pull/10394#issuecomment-1474916032

   LGTM! Thanks for your contribution @cbalci !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131850729


##########
pinot-connectors/pinot-spark-3-connector/src/test/resources/schema/spark-schema.json:
##########
@@ -0,0 +1,105 @@
+{
+  "type" : "struct",
+  "fields" : [ {
+    "name" : "floatMetric",
+    "type" : "float",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "doubleMetric",
+    "type" : "double",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "longMetric",
+    "type" : "long",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "intMetric",
+    "type" : "integer",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "stringMetric",
+    "type" : "string",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "stringDim",
+    "type" : "string",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "intDim",
+    "type" : "integer",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "floatDim",
+    "type" : "float",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "doubleDim",
+    "type" : "double",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "longDim",
+    "type" : "long",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "boolDim",
+    "type" : "boolean",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "stringArrayDim",
+    "type" : {
+      "type" : "array",
+      "elementType" : "string",
+      "containsNull" : true
+    },
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "floatArrayDim",
+    "type" : {
+      "type" : "array",
+      "elementType" : "float",
+      "containsNull" : true
+    },
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "boolArrayDim",
+    "type" : {
+      "type" : "array",
+      "elementType" : "boolean",
+      "containsNull" : true
+    },
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "byteDim",
+    "type" : {
+      "type" : "array",
+      "elementType" : "byte",
+      "containsNull" : true
+    },
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "outgoingTimeField",
+    "type" : "integer",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "timestampField",
+    "type" : "long",
+    "nullable" : true,
+    "metadata" : { }
+  } ]
+}

Review Comment:
   Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jackjlli commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "jackjlli (via GitHub)" <gi...@apache.org>.
jackjlli commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131646222


##########
pinot-connectors/pinot-spark-3-connector/src/test/resources/schema/pinot-schema.json:
##########
@@ -0,0 +1,75 @@
+{
+  "schemaName" : "schemaName",
+  "dimensionFieldSpecs" : [ {
+    "name" : "stringDim",
+    "dataType" : "STRING"
+  }, {
+    "name" : "intDim",
+    "dataType" : "INT"
+  }, {
+    "name" : "floatDim",
+    "dataType" : "FLOAT",
+    "defaultNullValue" : 0.0
+  }, {
+    "name" : "doubleDim",
+    "dataType" : "DOUBLE"
+  }, {
+    "name" : "longDim",
+    "dataType" : "LONG"
+  }, {
+    "name": "boolDim",
+    "dataType": "BOOLEAN"
+  }, {
+    "name" : "stringArrayDim",
+    "dataType" : "STRING",
+    "singleValueField" : false
+  }, {
+    "name" : "floatArrayDim",
+    "dataType" : "FLOAT",
+    "singleValueField" : false
+  }, {
+    "name": "boolArrayDim",
+    "dataType": "BOOLEAN",
+    "singleValueField": false
+  } ],
+  "metricFieldSpecs" : [ {
+    "name" : "floatMetric",
+    "dataType" : "FLOAT"
+  }, {
+    "name" : "doubleMetric",
+    "dataType" : "DOUBLE"
+  }, {
+    "name" : "longMetric",
+    "dataType" : "LONG"
+  }, {
+    "name" : "intMetric",
+    "dataType" : "INT",
+    "defaultNullValue" : 10
+  }, {
+    "name" : "stringMetric",
+    "dataType" : "STRING"
+  }, {
+    "name" : "byteDim",
+    "dataType" : "BYTES"
+  } ],
+  "timeFieldSpec" : {
+    "incomingGranularitySpec" : {
+      "name" : "incomingTimeField",
+      "dataType" : "LONG",
+      "timeType" : "SECONDS"
+    },
+    "outgoingGranularitySpec" : {
+      "name" : "outgoingTimeField",
+      "dataType" : "INT",
+      "timeType" : "DAYS"
+    }
+  },
+  "dateTimeFieldSpecs": [
+    {
+      "name": "timestampField",
+      "dataType": "TIMESTAMP",
+      "format": "1:MILLISECONDS:EPOCH",
+      "granularity": "1:SECONDS"
+    }
+  ]
+}

Review Comment:
   Missing empty line at the bottom.



##########
pinot-connectors/pinot-spark-3-connector/README.md:
##########
@@ -0,0 +1,69 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+# Spark-Pinot Connector
+
+Spark-pinot connector to read and write data from/to Pinot.

Review Comment:
   Is this PR just related to read data not to write, right? If that's the case, can we state it here?



##########
pinot-connectors/pinot-spark-3-connector/src/test/resources/schema/spark-schema.json:
##########
@@ -0,0 +1,105 @@
+{
+  "type" : "struct",
+  "fields" : [ {
+    "name" : "floatMetric",
+    "type" : "float",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "doubleMetric",
+    "type" : "double",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "longMetric",
+    "type" : "long",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "intMetric",
+    "type" : "integer",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "stringMetric",
+    "type" : "string",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "stringDim",
+    "type" : "string",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "intDim",
+    "type" : "integer",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "floatDim",
+    "type" : "float",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "doubleDim",
+    "type" : "double",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "longDim",
+    "type" : "long",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "boolDim",
+    "type" : "boolean",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "stringArrayDim",
+    "type" : {
+      "type" : "array",
+      "elementType" : "string",
+      "containsNull" : true
+    },
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "floatArrayDim",
+    "type" : {
+      "type" : "array",
+      "elementType" : "float",
+      "containsNull" : true
+    },
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "boolArrayDim",
+    "type" : {
+      "type" : "array",
+      "elementType" : "boolean",
+      "containsNull" : true
+    },
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "byteDim",
+    "type" : {
+      "type" : "array",
+      "elementType" : "byte",
+      "containsNull" : true
+    },
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "outgoingTimeField",
+    "type" : "integer",
+    "nullable" : true,
+    "metadata" : { }
+  }, {
+    "name" : "timestampField",
+    "type" : "long",
+    "nullable" : true,
+    "metadata" : { }
+  } ]
+}

Review Comment:
   Same here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131848657


##########
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md:
##########
@@ -0,0 +1,140 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+# Read Model
+
+Connector can scan offline, hybrid and realtime tables. Base two options <`table` and `tableType`> parameters have to given like below;
+- For offline table: `table: tbl`, `tableType: OFFLINE or offline`
+- For realtime table `table: tbl`, `tableType: REALTIME or realtime`
+- For hybrid table `table: tbl`, `tableType: HYBRID or hybrid`
+
+An example scan;
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "offline")
+      .load()
+```
+
+Custom schema can be specified directly. If schema is not specified, connector read table schema from Pinot controller, and then convert to the Spark schema. 
+
+### Architecture
+
+Connector reads data from `Pinot Servers` directly. For this operation, firstly, connector creates query with given filters(if filter push down is enabled) and columns, then finds routing table for created query. It creates pinot splits that contains **ONE PINOT SERVER and ONE OR MORE SEGMENT per spark partition**, based on the routing table and `segmentsPerSplit`(detailed explain is defined below). Lastly, each partition read data from specified pinot server in parallel.
+
+![Spark-Pinot Connector Architecture](images/spark-pinot-connector-executor-server-interaction.jpg)
+
+Each Spark partition open connection with Pinot server, and read data. For example, assume that routing table informations for specified query is like that:
+
+```
+- realtime ->
+   - realtimeServer1 -> (segment1, segment2, segment3)
+   - realtimeServer2 -> (segment4)
+- offline ->
+   - offlineServer10 -> (segment10, segment20)
+```
+
+If `segmentsPerSplit` is equal to 3, there will be created 3 Spark partition like below;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1, segment2, segment3  |
+| partition2  | realtimeServer2 / segment4  |
+| partition3  | offlineServer10 / segment10, segment20 |
+
+If `segmentsPerSplit` is equal to 1, there will be created 6 Spark partition;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1 |
+| partition2  | realtimeServer1 / segment2  |
+| partition3  | realtimeServer1 / segment3 |
+| partition4  | realtimeServer2 / segment4 |
+| partition5  | offlineServer10 / segment10 |
+| partition6  | offlineServer10 / segment20 |
+
+If `segmentsPerSplit` value is too low, that means more parallelism. But this also mean that a lot of connection will be opened with Pinot servers, and will increase QPS on the Pinot servers. 
+
+If `segmetnsPerSplit` value is too high, that means less parallelism. Each Pinot server will scan more segments per request.  
+
+**Note:** Pinot servers prunes segments based on the segment metadata when query comes. In some cases(for example filtering based on the some columns), some servers may not return data. Therefore, some Spark partitions will be empty. In this cases, `repartition()` may be applied for efficient data analysis after loading data to Spark.
+
+### Filter And Column Push Down
+Connector supports filter and column push down. Filters and columns are pushed to the pinot servers. Filter and column push down improves the performance while reading data because of its minimizing data transfer between Pinot and Spark. In default, filter push down enabled. If filters are desired to be applied in Spark, `usePushDownFilters` should be set as `false`.
+
+Connector uses SQL, as a result all sql filters are supported. 
+
+### Segment Pruning
+
+Connector receives routing table of given query to get information on which Pinot servers to will be queried and which segments will be scan. If partitioning is enabled for given Pinot table, and created query in Spark will be scan the specific partitions, only required Pinot server and segment informations will be got(that means segment pruning operation will be applied before data reading like Pinot brokers). For more information; [Optimizing Scatter and Gather in Pinot](https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing#optimizing-scatter-and-gather)
+
+### Table Querying
+Connector uses SQL to query Pinot tables.
+
+Connector creates realtime and offline queries based on the filters and required columns. 
+- If queried table type is `OFFLINE` or `REALTIME`, routing table information will be got for specific table type. 
+- If queried table type is `HYBRID`, realtime and offline routing table information will be got. Also, connector receives `TimeBoundary` information for given table, and use it in query to ensure that the overlap between realtime and offline segment data is queried exactly once. For more information; [Pinot Broker](https://docs.pinot.apache.org/basics/components/broker)
+
+### Query Generation
+
+Example generated queries for given usages(assume that `airlineStats` table is hybrid and TimeBoundary information is `DaysSinceEpoch, 16084`); 
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "hybrid")
+      .load()
+```
+
+For above usage, realtime and offline SQL queries will be created;
+
+- Offline query: `select * from airlineStats_OFFLINE where DaysSinceEpoch < 16084 LIMIT {Int.MaxValue}`
+
+- Realtime query: `select * from airlineStats_REALTIME where DaysSinceEpoch >= 16084 LIMIT {Int.MaxValue}`
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "offline")
+      .load()
+      .filter($"DestStateName" === "Florida")
+      .filter($"Origin" === "ORD")
+      .select($"Carrier")

Review Comment:
   Good catch, fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on PR #10394:
URL: https://github.com/apache/pinot/pull/10394#issuecomment-1463366238

   from @KKcorps :
   > Can you try to test it in cluster mode while running on a proper YARN, AWS EMR or DataProc cluster.
   
   Good suggestion. I ended up testing this in our YARN environment with success. Although, I have to note that our YARN environment runs Spark 3.0.2 so I had to downgrade Spark version of the connector for the test. Everything else worked as expected.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] GSharayu commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "GSharayu (via GitHub)" <gi...@apache.org>.
GSharayu commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131258298


##########
pinot-connectors/pinot-spark-3-connector/pom.xml:
##########
@@ -0,0 +1,324 @@
+<?xml version="1.0"?>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+  <parent>
+    <artifactId>pinot-connectors</artifactId>
+    <groupId>org.apache.pinot</groupId>
+    <version>0.13.0-SNAPSHOT</version>
+    <relativePath>..</relativePath>
+  </parent>
+  <artifactId>pinot-spark-3-connector</artifactId>
+  <name>Pinot Spark 3 Connector</name>
+  <url>https://pinot.apache.org/</url>
+  <properties>
+    <pinot.root>${basedir}/../..</pinot.root>
+    <spark.version>3.2.1</spark.version>
+    <antlr-runtime.version>4.8</antlr-runtime.version>
+    <scalatest.version>3.1.1</scalatest.version>
+    <shadeBase>org.apache.pinot.\$internal</shadeBase>
+
+    <!-- TODO: delete this prop once all the checkstyle warnings are fixed -->
+    <checkstyle.fail.on.violation>false</checkstyle.fail.on.violation>
+  </properties>
+
+  <profiles>
+    <profile>
+      <id>scala-2.12</id>
+      <activation>
+        <activeByDefault>true</activeByDefault>
+      </activation>
+      <properties>
+        <scala.version>2.12.11</scala.version>
+        <scala.compat.version>2.12</scala.compat.version>
+      </properties>
+      <dependencies>
+        <dependency>
+          <groupId>org.apache.spark</groupId>
+          <artifactId>spark-sql_${scala.compat.version}</artifactId>
+          <version>${spark.version}</version>
+          <scope>provided</scope>
+          <exclusions>
+            <exclusion>
+              <groupId>org.antlr</groupId>
+              <artifactId>antlr4-runtime</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.apache.curator</groupId>
+              <artifactId>curator-recipes</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>com.thoughtworks.paranamer</groupId>
+              <artifactId>paranamer</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang.modules</groupId>
+              <artifactId>scala-xml_${scala.compat.version}</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-library</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>com.zaxxer</groupId>
+              <artifactId>HikariCP-java7</artifactId>
+            </exclusion>
+          </exclusions>
+        </dependency>
+
+        <dependency>
+          <groupId>org.scala-lang</groupId>
+          <artifactId>scala-library</artifactId>
+          <version>${scala.version}</version>
+          <scope>provided</scope>
+        </dependency>
+        <!-- tests -->
+        <dependency>
+          <groupId>org.scalatest</groupId>
+          <artifactId>scalatest_${scala.compat.version}</artifactId>
+          <version>${scalatest.version}</version>
+          <scope>test</scope>
+          <exclusions>
+            <exclusion>
+              <groupId>org.scala-lang.modules</groupId>
+              <artifactId>scala-xml_${scala.compat.version}</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-library</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-reflect</artifactId>
+            </exclusion>
+          </exclusions>
+        </dependency>
+      </dependencies>
+      <build>
+        <plugins>
+          <!-- scala build -->
+          <plugin>
+            <groupId>org.xolstice.maven.plugins</groupId>
+            <artifactId>protobuf-maven-plugin</artifactId>
+            <version>0.6.1</version>
+            <configuration>
+              <protocArtifact>com.google.protobuf:protoc:3.19.2:exe:${os.detected.classifier}</protocArtifact>
+              <pluginId>grpc-java</pluginId>
+              <pluginArtifact>io.grpc:protoc-gen-grpc-java:1.44.1:exe:${os.detected.classifier}</pluginArtifact>
+            </configuration>
+            <executions>
+              <execution>
+                <goals>
+                  <goal>compile</goal>
+                  <goal>compile-custom</goal>
+                </goals>
+              </execution>
+            </executions>
+          </plugin>
+          <plugin>
+            <artifactId>maven-shade-plugin</artifactId>
+            <executions>
+              <execution>
+                <phase>package</phase>
+                <goals>
+                  <goal>shade</goal>
+                </goals>
+                <configuration>
+                  <relocations>
+                    <relocation>
+                      <pattern>com</pattern>
+                      <shadedPattern>${shadeBase}.com</shadedPattern>
+                      <includes>
+                        <include>com.google.protobuf.**</include>
+                        <include>com.google.common.**</include>

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] GSharayu commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "GSharayu (via GitHub)" <gi...@apache.org>.
GSharayu commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131260556


##########
pinot-connectors/pinot-spark-3-connector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataSource.scala:
##########
@@ -0,0 +1,51 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.connector.spark.v3.datasource
+
+import org.apache.pinot.connector.spark.common.{PinotClusterClient, PinotDataSourceReadOptions}
+import org.apache.spark.sql.connector.catalog.{Table, TableProvider}
+import org.apache.spark.sql.connector.expressions.Transform
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import java.util
+
+class PinotDataSource extends TableProvider with DataSourceRegister {

Review Comment:
   Can we please have docs over classes wherever important and possible



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131847893


##########
pinot-connectors/pinot-spark-3-connector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataSource.scala:
##########
@@ -0,0 +1,51 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.connector.spark.v3.datasource
+
+import org.apache.pinot.connector.spark.common.{PinotClusterClient, PinotDataSourceReadOptions}
+import org.apache.spark.sql.connector.catalog.{Table, TableProvider}
+import org.apache.spark.sql.connector.expressions.Transform
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import java.util
+
+class PinotDataSource extends TableProvider with DataSourceRegister {

Review Comment:
   Thanks for the suggestion. Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131851353


##########
pinot-connectors/pinot-spark-3-connector/README.md:
##########
@@ -0,0 +1,69 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+# Spark-Pinot Connector
+
+Spark-pinot connector to read and write data from/to Pinot.

Review Comment:
   You're right, this is for reads only. Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] GSharayu commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "GSharayu (via GitHub)" <gi...@apache.org>.
GSharayu commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131257095


##########
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md:
##########
@@ -0,0 +1,140 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+# Read Model
+
+Connector can scan offline, hybrid and realtime tables. Base two options <`table` and `tableType`> parameters have to given like below;
+- For offline table: `table: tbl`, `tableType: OFFLINE or offline`
+- For realtime table `table: tbl`, `tableType: REALTIME or realtime`
+- For hybrid table `table: tbl`, `tableType: HYBRID or hybrid`
+
+An example scan;
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "offline")
+      .load()
+```
+
+Custom schema can be specified directly. If schema is not specified, connector read table schema from Pinot controller, and then convert to the Spark schema. 
+
+### Architecture
+
+Connector reads data from `Pinot Servers` directly. For this operation, firstly, connector creates query with given filters(if filter push down is enabled) and columns, then finds routing table for created query. It creates pinot splits that contains **ONE PINOT SERVER and ONE OR MORE SEGMENT per spark partition**, based on the routing table and `segmentsPerSplit`(detailed explain is defined below). Lastly, each partition read data from specified pinot server in parallel.
+
+![Spark-Pinot Connector Architecture](images/spark-pinot-connector-executor-server-interaction.jpg)
+
+Each Spark partition open connection with Pinot server, and read data. For example, assume that routing table informations for specified query is like that:
+
+```
+- realtime ->
+   - realtimeServer1 -> (segment1, segment2, segment3)
+   - realtimeServer2 -> (segment4)
+- offline ->
+   - offlineServer10 -> (segment10, segment20)
+```
+
+If `segmentsPerSplit` is equal to 3, there will be created 3 Spark partition like below;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1, segment2, segment3  |
+| partition2  | realtimeServer2 / segment4  |
+| partition3  | offlineServer10 / segment10, segment20 |
+
+If `segmentsPerSplit` is equal to 1, there will be created 6 Spark partition;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1 |
+| partition2  | realtimeServer1 / segment2  |
+| partition3  | realtimeServer1 / segment3 |
+| partition4  | realtimeServer2 / segment4 |
+| partition5  | offlineServer10 / segment10 |
+| partition6  | offlineServer10 / segment20 |
+
+If `segmentsPerSplit` value is too low, that means more parallelism. But this also mean that a lot of connection will be opened with Pinot servers, and will increase QPS on the Pinot servers. 
+
+If `segmetnsPerSplit` value is too high, that means less parallelism. Each Pinot server will scan more segments per request.  
+
+**Note:** Pinot servers prunes segments based on the segment metadata when query comes. In some cases(for example filtering based on the some columns), some servers may not return data. Therefore, some Spark partitions will be empty. In this cases, `repartition()` may be applied for efficient data analysis after loading data to Spark.
+
+### Filter And Column Push Down
+Connector supports filter and column push down. Filters and columns are pushed to the pinot servers. Filter and column push down improves the performance while reading data because of its minimizing data transfer between Pinot and Spark. In default, filter push down enabled. If filters are desired to be applied in Spark, `usePushDownFilters` should be set as `false`.
+
+Connector uses SQL, as a result all sql filters are supported. 
+
+### Segment Pruning
+
+Connector receives routing table of given query to get information on which Pinot servers to will be queried and which segments will be scan. If partitioning is enabled for given Pinot table, and created query in Spark will be scan the specific partitions, only required Pinot server and segment informations will be got(that means segment pruning operation will be applied before data reading like Pinot brokers). For more information; [Optimizing Scatter and Gather in Pinot](https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing#optimizing-scatter-and-gather)
+
+### Table Querying
+Connector uses SQL to query Pinot tables.
+
+Connector creates realtime and offline queries based on the filters and required columns. 
+- If queried table type is `OFFLINE` or `REALTIME`, routing table information will be got for specific table type. 
+- If queried table type is `HYBRID`, realtime and offline routing table information will be got. Also, connector receives `TimeBoundary` information for given table, and use it in query to ensure that the overlap between realtime and offline segment data is queried exactly once. For more information; [Pinot Broker](https://docs.pinot.apache.org/basics/components/broker)
+
+### Query Generation
+
+Example generated queries for given usages(assume that `airlineStats` table is hybrid and TimeBoundary information is `DaysSinceEpoch, 16084`); 
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "hybrid")
+      .load()
+```
+
+For above usage, realtime and offline SQL queries will be created;
+
+- Offline query: `select * from airlineStats_OFFLINE where DaysSinceEpoch < 16084 LIMIT {Int.MaxValue}`
+
+- Realtime query: `select * from airlineStats_REALTIME where DaysSinceEpoch >= 16084 LIMIT {Int.MaxValue}`
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "offline")
+      .load()
+      .filter($"DestStateName" === "Florida")
+      .filter($"Origin" === "ORD")
+      .select($"Carrier")

Review Comment:
   minor, there should be all 3 columns in select?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] GSharayu commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "GSharayu (via GitHub)" <gi...@apache.org>.
GSharayu commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131260556


##########
pinot-connectors/pinot-spark-3-connector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataSource.scala:
##########
@@ -0,0 +1,51 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.connector.spark.v3.datasource
+
+import org.apache.pinot.connector.spark.common.{PinotClusterClient, PinotDataSourceReadOptions}
+import org.apache.spark.sql.connector.catalog.{Table, TableProvider}
+import org.apache.spark.sql.connector.expressions.Transform
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import java.util
+
+class PinotDataSource extends TableProvider with DataSourceRegister {

Review Comment:
   Can we please have javadocs over classes wherever important and possible, it helps in ease of understanding



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on pull request #10394: Pinot Spark Connector for Spark3

Posted by "KKcorps (via GitHub)" <gi...@apache.org>.
KKcorps commented on PR #10394:
URL: https://github.com/apache/pinot/pull/10394#issuecomment-1461770973

   Thanks for the contribution! Can you try to test it in cluster mode while running on a proper YARN, AWS EMR or DataProc cluster.
   
   Some of the times spark jobs inside pinot fail in these environments because of some conflicting runtime libs. 
   
   Solution is simply to keep on shading until the problems get resolved imo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131798826


##########
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md:
##########
@@ -0,0 +1,140 @@
+<!--

Review Comment:
   Disclosure: I copied this documentation from the Spark2 based implementation since at this point the functionality is pretty much same. However it is likely they will diverge pretty soon so I think they deserve separate doc pages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131798826


##########
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md:
##########
@@ -0,0 +1,140 @@
+<!--

Review Comment:
   Disclosure: I copied this documentation from the Spark2 based implementation since at this point the functionality is pretty much same. So it is possible that I overlooked at some details, but always happy to fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "KKcorps (via GitHub)" <gi...@apache.org>.
KKcorps commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1130794374


##########
pinot-connectors/pinot-spark-3-connector/pom.xml:
##########
@@ -0,0 +1,324 @@
+<?xml version="1.0"?>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+  <parent>
+    <artifactId>pinot-connectors</artifactId>
+    <groupId>org.apache.pinot</groupId>
+    <version>0.13.0-SNAPSHOT</version>
+    <relativePath>..</relativePath>
+  </parent>
+  <artifactId>pinot-spark-3-connector</artifactId>
+  <name>Pinot Spark 3 Connector</name>
+  <url>https://pinot.apache.org/</url>
+  <properties>
+    <pinot.root>${basedir}/../..</pinot.root>
+    <spark.version>3.2.1</spark.version>
+    <antlr-runtime.version>4.8</antlr-runtime.version>
+    <scalatest.version>3.1.1</scalatest.version>
+    <shadeBase>org.apache.pinot.\$internal</shadeBase>
+
+    <!-- TODO: delete this prop once all the checkstyle warnings are fixed -->
+    <checkstyle.fail.on.violation>false</checkstyle.fail.on.violation>
+  </properties>
+
+  <profiles>
+    <profile>
+      <id>scala-2.12</id>
+      <activation>
+        <activeByDefault>true</activeByDefault>
+      </activation>
+      <properties>
+        <scala.version>2.12.11</scala.version>
+        <scala.compat.version>2.12</scala.compat.version>
+      </properties>
+      <dependencies>
+        <dependency>
+          <groupId>org.apache.spark</groupId>
+          <artifactId>spark-sql_${scala.compat.version}</artifactId>
+          <version>${spark.version}</version>
+          <scope>provided</scope>
+          <exclusions>
+            <exclusion>
+              <groupId>org.antlr</groupId>
+              <artifactId>antlr4-runtime</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.apache.curator</groupId>
+              <artifactId>curator-recipes</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>com.thoughtworks.paranamer</groupId>
+              <artifactId>paranamer</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang.modules</groupId>
+              <artifactId>scala-xml_${scala.compat.version}</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-library</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>com.zaxxer</groupId>
+              <artifactId>HikariCP-java7</artifactId>
+            </exclusion>
+          </exclusions>
+        </dependency>
+
+        <dependency>
+          <groupId>org.scala-lang</groupId>
+          <artifactId>scala-library</artifactId>
+          <version>${scala.version}</version>
+          <scope>provided</scope>
+        </dependency>
+        <!-- tests -->
+        <dependency>
+          <groupId>org.scalatest</groupId>
+          <artifactId>scalatest_${scala.compat.version}</artifactId>
+          <version>${scalatest.version}</version>
+          <scope>test</scope>
+          <exclusions>
+            <exclusion>
+              <groupId>org.scala-lang.modules</groupId>
+              <artifactId>scala-xml_${scala.compat.version}</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-library</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-reflect</artifactId>
+            </exclusion>
+          </exclusions>
+        </dependency>
+      </dependencies>
+      <build>
+        <plugins>
+          <!-- scala build -->
+          <plugin>
+            <groupId>org.xolstice.maven.plugins</groupId>
+            <artifactId>protobuf-maven-plugin</artifactId>
+            <version>0.6.1</version>
+            <configuration>
+              <protocArtifact>com.google.protobuf:protoc:3.19.2:exe:${os.detected.classifier}</protocArtifact>
+              <pluginId>grpc-java</pluginId>
+              <pluginArtifact>io.grpc:protoc-gen-grpc-java:1.44.1:exe:${os.detected.classifier}</pluginArtifact>
+            </configuration>
+            <executions>
+              <execution>
+                <goals>
+                  <goal>compile</goal>
+                  <goal>compile-custom</goal>
+                </goals>
+              </execution>
+            </executions>
+          </plugin>
+          <plugin>
+            <artifactId>maven-shade-plugin</artifactId>
+            <executions>
+              <execution>
+                <phase>package</phase>
+                <goals>
+                  <goal>shade</goal>
+                </goals>
+                <configuration>
+                  <relocations>
+                    <relocation>
+                      <pattern>com</pattern>
+                      <shadedPattern>${shadeBase}.com</shadedPattern>
+                      <includes>
+                        <include>com.google.protobuf.**</include>
+                        <include>com.google.common.**</include>

Review Comment:
   Can we check for hadoop classes and also shade them as well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] GSharayu commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "GSharayu (via GitHub)" <gi...@apache.org>.
GSharayu commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131253107


##########
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md:
##########
@@ -0,0 +1,140 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+# Read Model
+
+Connector can scan offline, hybrid and realtime tables. Base two options <`table` and `tableType`> parameters have to given like below;
+- For offline table: `table: tbl`, `tableType: OFFLINE or offline`
+- For realtime table `table: tbl`, `tableType: REALTIME or realtime`
+- For hybrid table `table: tbl`, `tableType: HYBRID or hybrid`
+
+An example scan;
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "offline")
+      .load()
+```
+
+Custom schema can be specified directly. If schema is not specified, connector read table schema from Pinot controller, and then convert to the Spark schema. 
+
+### Architecture
+
+Connector reads data from `Pinot Servers` directly. For this operation, firstly, connector creates query with given filters(if filter push down is enabled) and columns, then finds routing table for created query. It creates pinot splits that contains **ONE PINOT SERVER and ONE OR MORE SEGMENT per spark partition**, based on the routing table and `segmentsPerSplit`(detailed explain is defined below). Lastly, each partition read data from specified pinot server in parallel.
+
+![Spark-Pinot Connector Architecture](images/spark-pinot-connector-executor-server-interaction.jpg)
+
+Each Spark partition open connection with Pinot server, and read data. For example, assume that routing table informations for specified query is like that:
+
+```
+- realtime ->
+   - realtimeServer1 -> (segment1, segment2, segment3)
+   - realtimeServer2 -> (segment4)
+- offline ->
+   - offlineServer10 -> (segment10, segment20)
+```
+
+If `segmentsPerSplit` is equal to 3, there will be created 3 Spark partition like below;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1, segment2, segment3  |
+| partition2  | realtimeServer2 / segment4  |
+| partition3  | offlineServer10 / segment10, segment20 |
+
+If `segmentsPerSplit` is equal to 1, there will be created 6 Spark partition;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1 |
+| partition2  | realtimeServer1 / segment2  |
+| partition3  | realtimeServer1 / segment3 |
+| partition4  | realtimeServer2 / segment4 |
+| partition5  | offlineServer10 / segment10 |
+| partition6  | offlineServer10 / segment20 |
+
+If `segmentsPerSplit` value is too low, that means more parallelism. But this also mean that a lot of connection will be opened with Pinot servers, and will increase QPS on the Pinot servers. 
+
+If `segmetnsPerSplit` value is too high, that means less parallelism. Each Pinot server will scan more segments per request.  

Review Comment:
   (nit) typo segmentsPerSplit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131800899


##########
pinot-connectors/pinot-spark-3-connector/README.md:
##########
@@ -0,0 +1,69 @@
+<!--

Review Comment:
   Disclosure: I copied this documentation from the Spark2 based implementation since at this point the functionality is pretty much same. However it is likely they will diverge pretty soon so I think they deserve separate doc pages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] GSharayu commented on pull request #10394: Pinot Spark Connector for Spark3

Posted by "GSharayu (via GitHub)" <gi...@apache.org>.
GSharayu commented on PR #10394:
URL: https://github.com/apache/pinot/pull/10394#issuecomment-1466671456

   LGTM as well. Thank you for the changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #10394: Pinot Spark Connector for Spark3

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #10394:
URL: https://github.com/apache/pinot/pull/10394#issuecomment-1459163421

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#10394](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (17fce62) into [master](https://codecov.io/gh/apache/pinot/commit/5d0089a6faea2abee60f462f20be80f042af76c8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5d0089a) will **decrease** coverage by `0.01%`.
   > The diff coverage is `0.00%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #10394      +/-   ##
   ============================================
   - Coverage     67.87%   67.87%   -0.01%     
   - Complexity     5742     5743       +1     
   ============================================
     Files          1521     1521              
     Lines         80305    80314       +9     
     Branches      12826    12829       +3     
   ============================================
   + Hits          54506    54512       +6     
   - Misses        21957    21964       +7     
   + Partials       3842     3838       -4     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | unittests1 | `67.87% <0.00%> (-0.01%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [.../pinot/segment/local/upsert/ComparisonColumns.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91cHNlcnQvQ29tcGFyaXNvbkNvbHVtbnMuamF2YQ==) | `93.75% <0.00%> (-6.25%)` | :arrow_down: |
   | [...ore/query/scheduler/resources/ResourceManager.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9zY2hlZHVsZXIvcmVzb3VyY2VzL1Jlc291cmNlTWFuYWdlci5qYXZh) | `88.46% <0.00%> (-11.54%)` | :arrow_down: |
   | [.../pinot/core/query/scheduler/PriorityScheduler.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9zY2hlZHVsZXIvUHJpb3JpdHlTY2hlZHVsZXIuamF2YQ==) | `77.77% <0.00%> (-5.56%)` | :arrow_down: |
   | [...e/pinot/segment/local/io/util/PinotDataBitSet.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9pby91dGlsL1Bpbm90RGF0YUJpdFNldC5qYXZh) | `95.62% <0.00%> (-1.46%)` | :arrow_down: |
   | [...inot/core/operator/filter/FilterOperatorUtils.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9vcGVyYXRvci9maWx0ZXIvRmlsdGVyT3BlcmF0b3JVdGlscy5qYXZh) | `85.55% <0.00%> (-1.03%)` | :arrow_down: |
   | [...core/query/reduce/ExplainPlanDataTableReducer.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9yZWR1Y2UvRXhwbGFpblBsYW5EYXRhVGFibGVSZWR1Y2VyLmphdmE=) | `82.31% <0.00%> (-0.69%)` | :arrow_down: |
   | [.../aggregation/function/ModeAggregationFunction.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9hZ2dyZWdhdGlvbi9mdW5jdGlvbi9Nb2RlQWdncmVnYXRpb25GdW5jdGlvbi5qYXZh) | `88.64% <0.00%> (+0.81%)` | :arrow_up: |
   | [...gregation/function/StUnionAggregationFunction.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9hZ2dyZWdhdGlvbi9mdW5jdGlvbi9TdFVuaW9uQWdncmVnYXRpb25GdW5jdGlvbi5qYXZh) | `76.47% <0.00%> (+2.94%)` | :arrow_up: |
   | [...he/pinot/segment/local/segment/store/IndexKey.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L3N0b3JlL0luZGV4S2V5LmphdmE=) | `75.00% <0.00%> (+5.00%)` | :arrow_up: |
   | [...ot/query/runtime/executor/RoundRobinScheduler.java](https://codecov.io/gh/apache/pinot/pull/10394?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtcXVlcnktcnVudGltZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvcXVlcnkvcnVudGltZS9leGVjdXRvci9Sb3VuZFJvYmluU2NoZWR1bGVyLmphdmE=) | `92.85% <0.00%> (+5.95%)` | :arrow_up: |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 merged pull request #10394: Pinot Spark Connector for Spark3

Posted by "xiangfu0 (via GitHub)" <gi...@apache.org>.
xiangfu0 merged PR #10394:
URL: https://github.com/apache/pinot/pull/10394


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] GSharayu commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "GSharayu (via GitHub)" <gi...@apache.org>.
GSharayu commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131257095


##########
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md:
##########
@@ -0,0 +1,140 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+# Read Model
+
+Connector can scan offline, hybrid and realtime tables. Base two options <`table` and `tableType`> parameters have to given like below;
+- For offline table: `table: tbl`, `tableType: OFFLINE or offline`
+- For realtime table `table: tbl`, `tableType: REALTIME or realtime`
+- For hybrid table `table: tbl`, `tableType: HYBRID or hybrid`
+
+An example scan;
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "offline")
+      .load()
+```
+
+Custom schema can be specified directly. If schema is not specified, connector read table schema from Pinot controller, and then convert to the Spark schema. 
+
+### Architecture
+
+Connector reads data from `Pinot Servers` directly. For this operation, firstly, connector creates query with given filters(if filter push down is enabled) and columns, then finds routing table for created query. It creates pinot splits that contains **ONE PINOT SERVER and ONE OR MORE SEGMENT per spark partition**, based on the routing table and `segmentsPerSplit`(detailed explain is defined below). Lastly, each partition read data from specified pinot server in parallel.
+
+![Spark-Pinot Connector Architecture](images/spark-pinot-connector-executor-server-interaction.jpg)
+
+Each Spark partition open connection with Pinot server, and read data. For example, assume that routing table informations for specified query is like that:
+
+```
+- realtime ->
+   - realtimeServer1 -> (segment1, segment2, segment3)
+   - realtimeServer2 -> (segment4)
+- offline ->
+   - offlineServer10 -> (segment10, segment20)
+```
+
+If `segmentsPerSplit` is equal to 3, there will be created 3 Spark partition like below;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1, segment2, segment3  |
+| partition2  | realtimeServer2 / segment4  |
+| partition3  | offlineServer10 / segment10, segment20 |
+
+If `segmentsPerSplit` is equal to 1, there will be created 6 Spark partition;
+
+| Spark Partition  | Queried Pinot Server/Segments |
+| ------------- | ------------- |
+| partition1  | realtimeServer1 / segment1 |
+| partition2  | realtimeServer1 / segment2  |
+| partition3  | realtimeServer1 / segment3 |
+| partition4  | realtimeServer2 / segment4 |
+| partition5  | offlineServer10 / segment10 |
+| partition6  | offlineServer10 / segment20 |
+
+If `segmentsPerSplit` value is too low, that means more parallelism. But this also mean that a lot of connection will be opened with Pinot servers, and will increase QPS on the Pinot servers. 
+
+If `segmetnsPerSplit` value is too high, that means less parallelism. Each Pinot server will scan more segments per request.  
+
+**Note:** Pinot servers prunes segments based on the segment metadata when query comes. In some cases(for example filtering based on the some columns), some servers may not return data. Therefore, some Spark partitions will be empty. In this cases, `repartition()` may be applied for efficient data analysis after loading data to Spark.
+
+### Filter And Column Push Down
+Connector supports filter and column push down. Filters and columns are pushed to the pinot servers. Filter and column push down improves the performance while reading data because of its minimizing data transfer between Pinot and Spark. In default, filter push down enabled. If filters are desired to be applied in Spark, `usePushDownFilters` should be set as `false`.
+
+Connector uses SQL, as a result all sql filters are supported. 
+
+### Segment Pruning
+
+Connector receives routing table of given query to get information on which Pinot servers to will be queried and which segments will be scan. If partitioning is enabled for given Pinot table, and created query in Spark will be scan the specific partitions, only required Pinot server and segment informations will be got(that means segment pruning operation will be applied before data reading like Pinot brokers). For more information; [Optimizing Scatter and Gather in Pinot](https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing#optimizing-scatter-and-gather)
+
+### Table Querying
+Connector uses SQL to query Pinot tables.
+
+Connector creates realtime and offline queries based on the filters and required columns. 
+- If queried table type is `OFFLINE` or `REALTIME`, routing table information will be got for specific table type. 
+- If queried table type is `HYBRID`, realtime and offline routing table information will be got. Also, connector receives `TimeBoundary` information for given table, and use it in query to ensure that the overlap between realtime and offline segment data is queried exactly once. For more information; [Pinot Broker](https://docs.pinot.apache.org/basics/components/broker)
+
+### Query Generation
+
+Example generated queries for given usages(assume that `airlineStats` table is hybrid and TimeBoundary information is `DaysSinceEpoch, 16084`); 
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "hybrid")
+      .load()
+```
+
+For above usage, realtime and offline SQL queries will be created;
+
+- Offline query: `select * from airlineStats_OFFLINE where DaysSinceEpoch < 16084 LIMIT {Int.MaxValue}`
+
+- Realtime query: `select * from airlineStats_REALTIME where DaysSinceEpoch >= 16084 LIMIT {Int.MaxValue}`
+
+```scala
+val df = spark.read
+      .format("pinot")
+      .option("table", "airlineStats")
+      .option("tableType", "offline")
+      .load()
+      .filter($"DestStateName" === "Florida")
+      .filter($"Origin" === "ORD")
+      .select($"Carrier")

Review Comment:
   minor, there should be all 3 columns?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131798826


##########
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md:
##########
@@ -0,0 +1,140 @@
+<!--

Review Comment:
   Full Disclosure: I copied this documentation from the Spark2 based implementation since at this point the functionality is pretty much same. So it is possible that I overlooked at some details, but always happy to fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on PR #10394:
URL: https://github.com/apache/pinot/pull/10394#issuecomment-1468873475

   Thanks for the reviews folks!
   
   > One suggestion would be to add a bash command to trigger this job in cluster mode that you use on YARN cluster.
   > 
   > Sometimes running the same command as local but adding `deploy-mode cluster` doesn't work properly.
   
   Good idea. Added a sample `spark-submit` command to the documentation for running the included examples in cluster mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on PR #10394:
URL: https://github.com/apache/pinot/pull/10394#issuecomment-1463367064

   Thanks for the reviews @KKcorps @GSharayu @jackjlli ! I tried to address/resolve your comments, ptal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] GSharayu commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "GSharayu (via GitHub)" <gi...@apache.org>.
GSharayu commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131260556


##########
pinot-connectors/pinot-spark-3-connector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataSource.scala:
##########
@@ -0,0 +1,51 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.connector.spark.v3.datasource
+
+import org.apache.pinot.connector.spark.common.{PinotClusterClient, PinotDataSourceReadOptions}
+import org.apache.spark.sql.connector.catalog.{Table, TableProvider}
+import org.apache.spark.sql.connector.expressions.Transform
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+import java.util
+
+class PinotDataSource extends TableProvider with DataSourceRegister {

Review Comment:
   Can we please have javadocs over classes wherever important and possible



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] cbalci commented on a diff in pull request #10394: Pinot Spark Connector for Spark3

Posted by "cbalci (via GitHub)" <gi...@apache.org>.
cbalci commented on code in PR #10394:
URL: https://github.com/apache/pinot/pull/10394#discussion_r1131777730


##########
pinot-connectors/pinot-spark-3-connector/pom.xml:
##########
@@ -0,0 +1,324 @@
+<?xml version="1.0"?>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+  <parent>
+    <artifactId>pinot-connectors</artifactId>
+    <groupId>org.apache.pinot</groupId>
+    <version>0.13.0-SNAPSHOT</version>
+    <relativePath>..</relativePath>
+  </parent>
+  <artifactId>pinot-spark-3-connector</artifactId>
+  <name>Pinot Spark 3 Connector</name>
+  <url>https://pinot.apache.org/</url>
+  <properties>
+    <pinot.root>${basedir}/../..</pinot.root>
+    <spark.version>3.2.1</spark.version>
+    <antlr-runtime.version>4.8</antlr-runtime.version>
+    <scalatest.version>3.1.1</scalatest.version>
+    <shadeBase>org.apache.pinot.\$internal</shadeBase>
+
+    <!-- TODO: delete this prop once all the checkstyle warnings are fixed -->
+    <checkstyle.fail.on.violation>false</checkstyle.fail.on.violation>
+  </properties>
+
+  <profiles>
+    <profile>
+      <id>scala-2.12</id>
+      <activation>
+        <activeByDefault>true</activeByDefault>
+      </activation>
+      <properties>
+        <scala.version>2.12.11</scala.version>
+        <scala.compat.version>2.12</scala.compat.version>
+      </properties>
+      <dependencies>
+        <dependency>
+          <groupId>org.apache.spark</groupId>
+          <artifactId>spark-sql_${scala.compat.version}</artifactId>
+          <version>${spark.version}</version>
+          <scope>provided</scope>
+          <exclusions>
+            <exclusion>
+              <groupId>org.antlr</groupId>
+              <artifactId>antlr4-runtime</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.apache.curator</groupId>
+              <artifactId>curator-recipes</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>com.thoughtworks.paranamer</groupId>
+              <artifactId>paranamer</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang.modules</groupId>
+              <artifactId>scala-xml_${scala.compat.version}</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-library</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>com.zaxxer</groupId>
+              <artifactId>HikariCP-java7</artifactId>
+            </exclusion>
+          </exclusions>
+        </dependency>
+
+        <dependency>
+          <groupId>org.scala-lang</groupId>
+          <artifactId>scala-library</artifactId>
+          <version>${scala.version}</version>
+          <scope>provided</scope>
+        </dependency>
+        <!-- tests -->
+        <dependency>
+          <groupId>org.scalatest</groupId>
+          <artifactId>scalatest_${scala.compat.version}</artifactId>
+          <version>${scalatest.version}</version>
+          <scope>test</scope>
+          <exclusions>
+            <exclusion>
+              <groupId>org.scala-lang.modules</groupId>
+              <artifactId>scala-xml_${scala.compat.version}</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-library</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.scala-lang</groupId>
+              <artifactId>scala-reflect</artifactId>
+            </exclusion>
+          </exclusions>
+        </dependency>
+      </dependencies>
+      <build>
+        <plugins>
+          <!-- scala build -->
+          <plugin>
+            <groupId>org.xolstice.maven.plugins</groupId>
+            <artifactId>protobuf-maven-plugin</artifactId>
+            <version>0.6.1</version>
+            <configuration>
+              <protocArtifact>com.google.protobuf:protoc:3.19.2:exe:${os.detected.classifier}</protocArtifact>
+              <pluginId>grpc-java</pluginId>
+              <pluginArtifact>io.grpc:protoc-gen-grpc-java:1.44.1:exe:${os.detected.classifier}</pluginArtifact>
+            </configuration>
+            <executions>
+              <execution>
+                <goals>
+                  <goal>compile</goal>
+                  <goal>compile-custom</goal>
+                </goals>
+              </execution>
+            </executions>
+          </plugin>
+          <plugin>
+            <artifactId>maven-shade-plugin</artifactId>
+            <executions>
+              <execution>
+                <phase>package</phase>
+                <goals>
+                  <goal>shade</goal>
+                </goals>
+                <configuration>
+                  <relocations>
+                    <relocation>
+                      <pattern>com</pattern>
+                      <shadedPattern>${shadeBase}.com</shadedPattern>
+                      <includes>
+                        <include>com.google.protobuf.**</include>
+                        <include>com.google.common.**</include>

Review Comment:
   Thanks for the suggestion. I checked the [dependency tree](https://gist.github.com/cbalci/f7b09e2df40f4c6731653584d9e51026) and it looks like the only hadoop dependencies are included in 'provided' scope, so I don't think we need to shade them here.
   Let me know if you think otherwise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org