You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by "a2l007 (via GitHub)" <gi...@apache.org> on 2023/05/22 22:10:08 UTC

[GitHub] [druid] a2l007 opened a new pull request, #14329: Extension to read and ingest iceberg data files

a2l007 opened a new pull request, #14329:
URL: https://github.com/apache/druid/pull/14329

Fixes #13923.

### Description

This adds a new contrib extension: `druid-iceberg-extensions` which can be used to ingest data stored in [iceberg](https://iceberg.apache.org/docs/latest/) format. It adds a new input source of type `iceberg` that connects to a catalog and retrieves the data files associated with an iceberg table and provides these data file paths to either an S3 or HDFS input source depending on the warehouse location.

Two important dependencies associated with iceberg tables are:
- Catalog : This extension supports reading from either a Hive Metastore catalog or a Local file-based catalog. Support for AWS Glue is not available yet.
- Warehouse : This extension supports reading data files from either HDFS or S3. Adapters for other cloud object locations should be easy to add by extending the `AbstractInputSourceAdapter`.

Sample ingestion spec:

```
"inputSource": {
"type": "iceberg",
"tableName": "logs",
"namespace": "webapp",
"icebergFilter": {
"type": "interval",
"filterColumn": "createTime",
"intervals": [
"2023-05-10T00:00:00.000Z/2023-05-15T00:00:00.000Z"
]
},
"icebergCatalog": {
"type": "hive",
"warehousePath": "hdfs://localwarehouse/",
"catalogUri": "thrift://hdfscatalog:9083",
"catalogProperties": {
"hive.metastore.connect.retries": "1",
"hive.metastore.execute.setugi": "false",
"hive.metastore.kerberos.principal": "principal@krb.com",
"hive.metastore.sasl.enabled": "true",
"hadoop.security.authentication": "kerberos",
"hadoop.security.authorization": "true",
"java.security.auth.login.config": "jaas.config",
}
},
"warehouseSource": {
"type": "hdfs"
},
"inputFormat": {
"type": "parquet"
}
```

#### Release note
<!-- Give your best effort to summarize your changes in a couple of sentences aimed toward Druid users.

If your change doesn't have end user impact, you can skip this section.

For tips about how to write a good release note, see [Release notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes).

-->
Added support to ingest Iceberg data files into Druid.

<hr>

##### Key changed/added classes in this PR
* `IcebergCatalog.java`
* `IcebergInputSource.java`

<hr>

This PR has:

- [x] been self-reviewed.
- [x] added documentation for new or modified features or behaviors.
- [x] a release note entry in the PR description.
- [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
- [x] been tested in a test Druid cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260527029


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|

Review Comment:
   ```suggestion
   |`catalogUri`|The URI associated with the `hive` catalog.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260534188


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:

Review Comment:
   ```suggestion
   The `interval` filter:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260342485


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension

Review Comment:
   I don't think this heading is necessary. Since the topic is on Iceberg, consider introducing the feature first and then talk about the extension as a means of enabling the feature. For example:
   
   Apache Iceberg is an open table format for huge analytic datasets. [Iceberg input source](../../ingestion/input-sources.md#iceberg-input-source) lets your ingest data stored in the Iceberg table format into Apache Druid. To  enable Iceberg input source, add the `druid-iceberg-extensions` extension to the list of extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
   
   Iceberg manages most of its metadata in metadata files in the object storage. In some cases, it uses a metastore to manage a certain amount of metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260514651


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:

Review Comment:
   ```suggestion
   The following is a sample spec for HDFS warehouse source:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260342485


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension

Review Comment:
   I don't think this heading is necessary. If you delete this heading, you can move the other headings up a level.
   
   Since the topic is about Iceberg ingestion, consider introducing the feature first and then talk about the extension as a means of enabling the feature. For example:
   
   Apache Iceberg is an open table format for huge analytic datasets. 
   [Iceberg input source](../../ingestion/input-sources.md#iceberg-input-source) lets your ingest data stored in the Iceberg table format into Apache Druid. To  enable the Iceberg input source, add `druid-iceberg-extensions` to the list of extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
   
   Iceberg manages most of its metadata in metadata files in the object storage. In some cases, it uses a metastore to manage a certain amount of metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1261364035


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"

Review Comment:
   ```suggestion
   title: "Iceberg extension"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260355964


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  

Review Comment:
   ```suggestion
   For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260500919


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.

Review Comment:
   ```suggestion
   You use the Iceberg input source to read data stored in the Iceberg table format. For a given table, the input source scans up to the latest Iceberg snapshot from the configured Hive catalog. Druid ingests the underlying live data files using the existing input source formats.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260501397


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.

Review Comment:
   ```suggestion
   The Iceberg input source cannot be independent as it relies on the existing input sources to read from the Data files.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1263117658


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:

Review Comment:
   `not` filter accepts a single filter whereas `and` & `or` accepts a list of iceberg filters.
   `filters` property accepts any of the other iceberg filters mentioned in this section.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260324391


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.

Review Comment:
   ```suggestion
   Apache Iceberg is an open table format for huge analytic datasets. Iceberg manages most of its metadata in metadata files in the object storage. In some cases, it uses a metastore to manage a certain amount of metadata.
   ```
   See comment on line 25.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260330433


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.

Review Comment:
   See comment on line 25.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1259003600


##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,313 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>${hadoop.compile.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>io.netty</groupId>
+          <artifactId>netty-buffer</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-cli</groupId>
+          <artifactId>commons-cli</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>log4j</groupId>
+          <artifactId>log4j</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-codec</groupId>
+          <artifactId>commons-codec</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-logging</groupId>
+          <artifactId>commons-logging</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-io</groupId>
+          <artifactId>commons-io</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-lang</groupId>
+          <artifactId>commons-lang</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpclient</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpcore</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.zookeeper</groupId>
+          <artifactId>zookeeper</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-log4j12</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>javax.ws.rs</groupId>
+          <artifactId>jsr311-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.code.findbugs</groupId>
+          <artifactId>jsr305</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty-util</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.protobuf</groupId>
+          <artifactId>protobuf-java</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.curator</groupId>
+          <artifactId>curator-client</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.commons</groupId>
+          <artifactId>commons-math3</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.guava</groupId>
+          <artifactId>guava</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.avro</groupId>
+          <artifactId>avro</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>net.java.dev.jets3t</groupId>
+          <artifactId>jets3t</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-json</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.jcraft</groupId>
+          <artifactId>jsch</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-server</artifactId>
+        </exclusion>
+        <!-- Following are excluded to remove security vulnerabilities: -->
+        <exclusion>
+          <groupId>commons-beanutils</groupId>
+          <artifactId>commons-beanutils-core</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.iceberg</groupId>
+      <artifactId>iceberg-spark-runtime-3.3_2.12</artifactId>
+      <version>1.0.0</version>
+      <exclusions>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hive</groupId>
+      <artifactId>hive-metastore</artifactId>
+      <version>3.1.3</version>
+      <exclusions>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-hdfs</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hbase</groupId>
+          <artifactId>hbase-client</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-processing</artifactId>
+      <version>${project.parent.version}</version>
+      <scope>provided</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-annotations</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-databind</artifactId>
+      <scope>provided</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-hdfs-client</artifactId>
+      <scope>runtime</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-mapreduce-client-core</artifactId>

Review Comment:
   hadoop-mapreduce-client-core isn't a shaded jar. This is pulled in here for the `org/apache/hadoop/mapred` classes that the Hive catalog requires at runtime. This is a relatively lean jar, but if we end up moving the heavier hadoop-client dependencies to core lib, we could point to the hadoop-client-api dependency instead of this one. Until then it would be fine to keep it as it is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260531282


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|

Review Comment:
   ```suggestion
   |`filterValue`|The value to filter on.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260529007


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:

Review Comment:
   ```suggestion
   The `equals` filter:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260527678


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:

Review Comment:
   ```suggestion
   ### Iceberg filter object
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260542481


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:

Review Comment:
   `and`, `or`, and `not` filters all have the same properties. Consider not using tables to present this information. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1635032689

   @ektravel Thank you for your review, I've addressed most of your comments. I haven't code formatted the input source properties as I'm following the same format as the other input sources described on that page. Let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1255047919


##########
processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java:
##########
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.impl.LocalInputSourceAdapter;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.java.util.common.CloseableIterators;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.parsers.CloseableIterator;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.
+ * For composing input sources such as IcebergInputSource, the delegate input source instantiation might fail upon deserialization since the input file paths
+ * are not available yet and this might fail the input source precondition checks.
+ * This adapter helps create the delegate input source once the input file paths are fully determined.
+ */
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
+@JsonSubTypes(value = {
+    @JsonSubTypes.Type(name = LocalInputSourceAdapter.TYPE_KEY, value = LocalInputSourceAdapter.class)
+})
+public abstract class AbstractInputSourceAdapter
+{
+  private SplittableInputSource inputSource;
+
+  public abstract SplittableInputSource generateInputSource(List<String> inputFilePaths);
+
+  public void setupInputSource(List<String> inputFilePaths)
+  {
+    if (inputSource != null) {
+      throw new ISE("Inputsource is already initialized!");
+    }
+    if (inputFilePaths.isEmpty()) {
+      inputSource = new EmptyInputSource();
+    } else {
+      inputSource = generateInputSource(inputFilePaths);
+    }
+  }
+
+  public SplittableInputSource getInputSource()
+  {
+    if (inputSource == null) {
+      throw new ISE("Inputsource is not initialized yet!");
+    }
+    return inputSource;
+  }
+
+  private static class EmptyInputSource implements SplittableInputSource

Review Comment:
   Added docs, thanks.



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.
+ */
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = InputFormat.TYPE_PROPERTY)
+public abstract class IcebergCatalog
+{
+  private static final Logger log = new Logger(IcebergCatalog.class);
+
+  public abstract BaseMetastoreCatalog retrieveCatalog();
+
+  /**
+   * Extract the iceberg data files upto the latest snapshot associated with the table

Review Comment:
   Yes, it gets the file paths from the scan tasks.



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.
+ */
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = InputFormat.TYPE_PROPERTY)
+public abstract class IcebergCatalog
+{
+  private static final Logger log = new Logger(IcebergCatalog.class);
+
+  public abstract BaseMetastoreCatalog retrieveCatalog();
+
+  /**
+   * Extract the iceberg data files upto the latest snapshot associated with the table
+   *
+   * @param tableNamespace The catalog namespace under which the table is defined
+   * @param tableName      The iceberg table name
+   * @return a list of data file paths
+   */
+  public List<String> extractSnapshotDataFiles(
+      String tableNamespace,
+      String tableName,
+      IcebergFilter icebergFilter
+  )
+  {
+    Catalog catalog = retrieveCatalog();
+    Namespace namespace = Namespace.of(tableNamespace);
+    String tableIdentifier = tableNamespace + "." + tableName;
+
+    List<String> dataFilePaths = new ArrayList<>();
+
+    ClassLoader currCtxClassloader = Thread.currentThread().getContextClassLoader();
+    try {
+      Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
+      TableIdentifier icebergTableIdentifier = catalog.listTables(namespace).stream()
+                                                      .filter(tableId -> tableId.toString().equals(tableIdentifier))
+                                                      .findFirst()
+                                                      .orElseThrow(() -> new IAE(
+                                                          " Couldn't retrieve table identifier for '%s'",

Review Comment:
   Updated the error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1255050026


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.

Review Comment:
   It would require the s3 extension to read the input paths after iceberginputsource fetches the paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] a2l007 commented on a diff in pull request #14329: Extension to read and ingest iceberg data files

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1225896931


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,120 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata on metadata files, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath.  

Review Comment:
   Yes only on the peons, fixed it in the docs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260486647


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.

Review Comment:
   ```suggestion
   Since the Hadoop AWS connector uses the `s3a` filesystem client, specify the warehouse path with the `s3a://` protocol instead of `s3://`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260544200


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `equals`.|Yes|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `interval`.|Yes|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|

Review Comment:
   ```suggestion
   |`filterValue`|The value to filter on.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260534188


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:

Review Comment:
   ```suggestion
   The `interval` filter:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260504769


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|

Review Comment:
   ```suggestion
   |`icebergCatalog`|The JSON object used to define the catalog that manages the configured Iceberg table.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260503661


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|

Review Comment:
   ```suggestion
   |`tableName`|The Iceberg table name configured in the catalog.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260526682


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `hive`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260499243


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.

Review Comment:
   ```suggestion
   > To use the Iceberg input source, add the `druid-iceberg-extensions` extension.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260538538


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|
+|filters|List of iceberg filters that needs to be AND-ed|yes|
+
+`or` Filter:

Review Comment:
   ```suggestion
   The `or` filter:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260530944


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|

Review Comment:
   ```suggestion
   |`filterColumn`|The name of the column from the Iceberg table schema to filter on.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1257886376


##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,314 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>

Review Comment:
   I think its fine since we will remove the hadoop 2 support very soon anyway. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergEqualsFilter.java:
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+
+
+public class IcebergEqualsFilter implements IcebergFilter
+{
+
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final String filterValue;
+
+  @JsonCreator
+  public IcebergEqualsFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("filterValue") String filterValue
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");

Review Comment:
   can the error message be adjusted similar to how you have done in IcebergIntervalFilter? 



##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,313 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>${hadoop.compile.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>io.netty</groupId>
+          <artifactId>netty-buffer</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-cli</groupId>
+          <artifactId>commons-cli</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>log4j</groupId>
+          <artifactId>log4j</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-codec</groupId>
+          <artifactId>commons-codec</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-logging</groupId>
+          <artifactId>commons-logging</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-io</groupId>
+          <artifactId>commons-io</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-lang</groupId>
+          <artifactId>commons-lang</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpclient</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpcore</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.zookeeper</groupId>
+          <artifactId>zookeeper</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-log4j12</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>javax.ws.rs</groupId>
+          <artifactId>jsr311-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.code.findbugs</groupId>
+          <artifactId>jsr305</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty-util</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.protobuf</groupId>
+          <artifactId>protobuf-java</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.curator</groupId>
+          <artifactId>curator-client</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.commons</groupId>
+          <artifactId>commons-math3</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.guava</groupId>
+          <artifactId>guava</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.avro</groupId>
+          <artifactId>avro</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>net.java.dev.jets3t</groupId>
+          <artifactId>jets3t</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-json</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.jcraft</groupId>
+          <artifactId>jsch</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-server</artifactId>
+        </exclusion>
+        <!-- Following are excluded to remove security vulnerabilities: -->
+        <exclusion>
+          <groupId>commons-beanutils</groupId>
+          <artifactId>commons-beanutils-core</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.iceberg</groupId>
+      <artifactId>iceberg-spark-runtime-3.3_2.12</artifactId>
+      <version>1.0.0</version>
+      <exclusions>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hive</groupId>
+      <artifactId>hive-metastore</artifactId>
+      <version>3.1.3</version>
+      <exclusions>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-hdfs</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hbase</groupId>
+          <artifactId>hbase-client</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-processing</artifactId>
+      <version>${project.parent.version}</version>
+      <scope>provided</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-annotations</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-databind</artifactId>
+      <scope>provided</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-hdfs-client</artifactId>
+      <scope>runtime</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-mapreduce-client-core</artifactId>

Review Comment:
   I started a discussion on #dev channel. It will be preferable to use the shaded jars to avoid dependency conflicts in the future. is this jar (hadoop-mapreduce-client-core) shaded? If you are not seeing any conflicts, it's fine for now. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.joda.time.Interval;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class IcebergIntervalFilter implements IcebergFilter
+{
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final List<Interval> intervals;
+
+  @JsonCreator
+  public IcebergIntervalFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("intervals") List<Interval> intervals
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");
+    Preconditions.checkNotNull(intervals, "intervals can not be null");
+    this.filterColumn = filterColumn;
+    this.intervals = intervals;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    for (Interval filterInterval : intervals) {
+      Long dateStart = (long) Literal.of(filterInterval.getStart().toString())
+                                     .to(Types.TimestampType.withZone())
+                                     .value();
+      Long dateEnd = (long) Literal.of(filterInterval.getEnd().toString())
+                                   .to(Types.TimestampType.withZone())
+                                   .value();
+
+      expressions.add(Expressions.and(
+          Expressions.greaterThanOrEqual(
+              filterColumn,
+              dateStart
+          ),
+          Expressions.lessThan(
+              filterColumn,
+              dateEnd
+          )
+      ));
+    }
+    Expression finalExpr = Expressions.alwaysFalse();

Review Comment:
   Naa. it's probably fine. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.joda.time.Interval;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class IcebergIntervalFilter implements IcebergFilter
+{
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final List<Interval> intervals;
+
+  @JsonCreator
+  public IcebergIntervalFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("intervals") List<Interval> intervals
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");
+    Preconditions.checkNotNull(intervals, "intervals can not be null");
+    this.filterColumn = filterColumn;
+    this.intervals = intervals;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    for (Interval filterInterval : intervals) {
+      Long dateStart = (long) Literal.of(filterInterval.getStart().toString())
+                                     .to(Types.TimestampType.withZone())

Review Comment:
   can you please add this bit as a doc here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260342485


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension

Review Comment:
   I don't think this heading is necessary. If you delete this heading, you can move the other headings up a level.
   
   Since the topic is about Iceberg ingestion, consider introducing the feature first and then talk about the extension as a means of enabling the feature. For example:
   
   [Apache Iceberg](https://iceberg.apache.org/docs/latest/) is an open table format for huge analytic datasets. 
   [Iceberg input source](../../ingestion/input-sources.md#iceberg-input-source) lets your ingest data stored in the Iceberg table format into Apache Druid. To  enable the Iceberg input source, add `druid-iceberg-extensions` to the list of extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
   
   Iceberg manages most of its metadata in metadata files in the object storage. In some cases, it uses a metastore to manage a certain amount of metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1641052315

   @abhishekagarwal87 Sure, raised #14608 14608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] a2l007 commented on a diff in pull request #14329: Extension to read and ingest iceberg data files

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1225897215


##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,314 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>

Review Comment:
   In the limitations of the extensions, it is specified that hadoop 2.x support is not tested. Do we still need a hadoop2 profile?



##########
processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java:
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.impl.LocalInputSourceAdapter;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.java.util.common.CloseableIterators;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.parsers.CloseableIterator;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.

Review Comment:
   Added more details here.



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.druid.data.input.AbstractInputSourceAdapter;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.data.input.InputRowSchema;
+import org.apache.druid.data.input.InputSource;
+import org.apache.druid.data.input.InputSourceReader;
+import org.apache.druid.data.input.InputSplit;
+import org.apache.druid.data.input.SplitHintSpec;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * Inputsource to ingest data managed by the Iceberg table format.
+ * This inputsource talks to the configured catalog, executes any configured filters and retrieves the data file paths upto the latest snapshot associated with the iceberg table.
+ * The data file paths are then provided to a native {@link SplittableInputSource} implementation depending on the warehouse source defined.
+ */
+public class IcebergInputSource implements SplittableInputSource<List<String>>
+{
+  public static final String TYPE_KEY = "iceberg";
+
+  @JsonProperty
+  private final String tableName;
+
+  @JsonProperty
+  private final String namespace;
+
+  @JsonProperty
+  private IcebergCatalog icebergCatalog;
+
+  @JsonProperty
+  private IcebergFilter icebergFilter;
+
+  @JsonProperty
+  private AbstractInputSourceAdapter warehouseSource;
+
+  private boolean isLoaded = false;
+
+  @JsonCreator
+  public IcebergInputSource(
+      @JsonProperty("tableName") String tableName,
+      @JsonProperty("namespace") String namespace,
+      @JsonProperty("icebergFilter") @Nullable IcebergFilter icebergFilter,
+      @JsonProperty("icebergCatalog") IcebergCatalog icebergCatalog,
+      @JsonProperty("warehouseSource") AbstractInputSourceAdapter warehouseSource

Review Comment:
   This can be any of the three available `AbstractInputSourceAdapter` implementations: `local`, `s3` or `hdfs`
   If any other identifier is provided here, the deserialization will fail it it isn't available in the class registries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260342485


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension

Review Comment:
   I think you can omit this heading. If you delete this heading, you can move the other headings up a level.
   
   Since the topic is about Iceberg ingestion, consider introducing the feature first and then talk about the extension as a means of enabling the feature. For example:
   
   Apache Iceberg is an open table format for huge analytic datasets. [Iceberg input source](../../ingestion/input-sources.md#iceberg-input-source) lets your ingest data stored in the Iceberg table format into Apache Druid. To  enable Iceberg input source, add the `druid-iceberg-extensions` extension to the list of extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
   
   Iceberg manages most of its metadata in metadata files in the object storage. In some cases, it uses a metastore to manage a certain amount of metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260518230


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`

Review Comment:
   ```suggestion
   The catalog object supports `local` and `hive` catalog types.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260530944


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|

Review Comment:
   ```suggestion
   |filterColumn|The name of the column from the Iceberg table schema to use for filtering.|yes|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:

Review Comment:
   ```suggestion
   The `equals` filter:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260539096


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|
+|filters|List of iceberg filters that needs to be AND-ed|yes|
+
+`or` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `or`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `or`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260535905


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260349966


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.

Review Comment:
   ```suggestion
   The `druid-iceberg-extensions` extension relies on the existing input source connectors in Druid to read the data files from the warehouse. Therefore, the Iceberg input source can be considered as an intermediate input source, which provides the file paths for other input source implementations.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260486024


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:

Review Comment:
   ```suggestion
   The following properties are required in the `catalogProperties`:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260485062


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:

Review Comment:
   ```suggestion
   Set the `type` property of the `warehouseSource` object to `s3` in the ingestion spec. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, include the following properties in the `warehouseSource` object to define the S3 endpoint settings:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260502738


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260503976


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|

Review Comment:
   ```suggestion
   |`namespace`|The Iceberg namespace associated with the table.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260523285


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:

Review Comment:
   ```suggestion
   The following table lists the properties of a local catalog:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260473564


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog

Review Comment:
   ```suggestion
   ## Local catalog
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260496160


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog
+
+The local catalog type can be used for catalogs configured on the local filesystem. The `icebergCatalog` type should be set as `local`. This catalog is useful for demos or localized tests and is not recommended for production use cases.
+This catalog only supports reading from a local filesystem and so the `warehouseSource` is defined as `local`.
+
+### Known limitations
+
+This extension does not presently fully utilize the iceberg features such as snapshotting or schema evolution. Following are the current limitations of this extension:

Review Comment:
   ```suggestion
   This section lists the known limitations that apply to the Iceberg extension.
   
   - This extension does not fully utilize the Iceberg features such as snapshotting or schema evolution. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260491735


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog
+
+The local catalog type can be used for catalogs configured on the local filesystem. The `icebergCatalog` type should be set as `local`. This catalog is useful for demos or localized tests and is not recommended for production use cases.
+This catalog only supports reading from a local filesystem and so the `warehouseSource` is defined as `local`.

Review Comment:
   ```suggestion
   The `warehouseSource` is set to `local` because this catalog only supports reading from a local filesystem.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1630387040

   Some dependencies are not required in extension since they are already present in core lib. E.g. guava (11 MB)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1639051264

   @ektravel Does the doc changes look good to you?
   
   @abhishekagarwal87 I agree that the dependencies need pruning and this is something that I'm working on. Bunch of the pruning work is going to be on the transitive deps for `hive-metastore` and splitting out the shaded `iceberg-spark-runtime` dependency into smaller constituents. Do you think this would be a blocker for this PR merge since the extension is not included as part of the distribution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260473325


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse

Review Comment:
   ```suggestion
   ### Read from S3 warehouse
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260358588


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:

Review Comment:
   ```suggestion
   If the Hive metastore supports Kerberos authentication, include `principal` and `keytab` properties in the `catalogProperties` object:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260355964


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  

Review Comment:
   ```suggestion
   For Druid to seamlessly talk to the Hive metastore, ensure that the Hive configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260497609


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog
+
+The local catalog type can be used for catalogs configured on the local filesystem. The `icebergCatalog` type should be set as `local`. This catalog is useful for demos or localized tests and is not recommended for production use cases.
+This catalog only supports reading from a local filesystem and so the `warehouseSource` is defined as `local`.
+
+### Known limitations
+
+This extension does not presently fully utilize the iceberg features such as snapshotting or schema evolution. Following are the current limitations of this extension:
+
+- The `IcebergInputSource` reads every single live file on the iceberg table up to the latest snapshot, which makes the table scan less performant. It is recommended to use iceberg filters on partition columns in the ingestion spec in order to limit the number of data files being retrieved. Since, Druid doesn't store the last ingested iceberg snapshot ID, it cannot identify the files created between that snapshot and the latest snapshot on iceberg.

Review Comment:
   ```suggestion
   - The Iceberg input source reads every single live file on the Iceberg table up to the latest snapshot, which makes the table scan less performant. It is recommended to use Iceberg filters on partition columns in the ingestion spec to limit the number of data files being retrieved. Since Druid doesn't store the last ingested iceberg snapshot ID, it cannot identify the files created between that snapshot and the latest snapshot on Iceberg.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260538864


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|
+|filters|List of iceberg filters that needs to be AND-ed|yes|
+
+`or` Filter:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260530944


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|

Review Comment:
   ```suggestion
   |`filterColumn`|The name of the column from the Iceberg table schema to use for filtering.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1639426663

   @a2l007 - do you want to backport the core changes to 27 so folks can try out the extension with the 27 release? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1640392649

   @a2l007 Thank you for making the requested changes. They look good to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] github-code-scanning[bot] commented on a diff in pull request #14329: Extension to read and ingest iceberg data files

Posted by "github-code-scanning[bot] (via GitHub)" <gi...@apache.org>.

github-code-scanning[bot] commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1201265868


##########
extensions-core/s3-extensions/src/test/java/org/apache/druid/data/input/s3/S3InputSourceAdapterTest.java:
##########
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input.s3;
+
+import com.amazonaws.ClientConfiguration;
+import com.amazonaws.services.s3.AmazonS3Client;
+import com.amazonaws.services.s3.AmazonS3ClientBuilder;
+import org.apache.druid.storage.s3.S3InputDataConfig;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.easymock.EasyMock;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class S3InputSourceAdapterTest
+{
+  @Test
+  public void testAdapterGet()
+  {
+    AmazonS3Client s3Client = EasyMock.createMock(AmazonS3Client.class);

Review Comment:
   ## Unread local variable
   
   Variable 'AmazonS3Client s3Client' is never read.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4975)



##########
extensions-core/s3-extensions/src/test/java/org/apache/druid/data/input/s3/S3InputSourceAdapterTest.java:
##########
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input.s3;
+
+import com.amazonaws.ClientConfiguration;
+import com.amazonaws.services.s3.AmazonS3Client;
+import com.amazonaws.services.s3.AmazonS3ClientBuilder;
+import org.apache.druid.storage.s3.S3InputDataConfig;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.easymock.EasyMock;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class S3InputSourceAdapterTest
+{
+  @Test
+  public void testAdapterGet()
+  {
+    AmazonS3Client s3Client = EasyMock.createMock(AmazonS3Client.class);
+    ClientConfiguration clientConfiguration = EasyMock.createMock(ClientConfiguration.class);
+    ServerSideEncryptingAmazonS3.Builder serverSides3Builder =
+        EasyMock.createMock(ServerSideEncryptingAmazonS3.Builder.class);
+    AmazonS3ClientBuilder s3ClientBuilder = AmazonS3Client.builder();

Review Comment:
   ## Unread local variable
   
   Variable 'AmazonS3ClientBuilder s3ClientBuilder' is never read.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4977)



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/HiveIcebergCatalog.java:
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JacksonInject;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import com.google.common.base.Strings;
+import org.apache.druid.iceberg.guice.HiveConf;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.hive.HiveCatalog;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Hive Metastore specific implementation of iceberg catalog.
+ * Kerberos authentication is performed if the credentials are provided in the catalog properties
+ */
+public class HiveIcebergCatalog extends IcebergCatalog
+{
+  public static final String TYPE_KEY = "hive";
+
+  @JsonProperty
+  private String warehousePath;
+
+  @JsonProperty
+  private String catalogUri;
+
+  @JsonProperty
+  private Map<String, String> catalogProperties;
+
+  private final Configuration configuration;
+
+  private BaseMetastoreCatalog hiveCatalog;
+
+  private static final Logger log = new Logger(HiveIcebergCatalog.class);
+
+  @JsonCreator
+  public HiveIcebergCatalog(
+      @JsonProperty("warehousePath") String warehousePath,
+      @JsonProperty("catalogUri") String catalogUri,
+      @JsonProperty("catalogProperties") @Nullable
+          Map<String, String> catalogProperties,
+      @JacksonInject @HiveConf Configuration configuration
+  )
+  {
+    this.warehousePath = Preconditions.checkNotNull(warehousePath, "warehousePath cannot be null");
+    this.catalogUri = Preconditions.checkNotNull(catalogUri, "catalogUri cannot be null");
+    this.catalogProperties = catalogProperties != null ? catalogProperties : new HashMap<>();
+    this.configuration = configuration;
+    catalogProperties

Review Comment:
   ## Dereferenced variable may be null
   
   Variable [catalogProperties](1) may be null at this access as suggested by [this](2) null guard.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4978)



##########
extensions-core/s3-extensions/src/test/java/org/apache/druid/data/input/s3/S3InputSourceAdapterTest.java:
##########
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input.s3;
+
+import com.amazonaws.ClientConfiguration;
+import com.amazonaws.services.s3.AmazonS3Client;
+import com.amazonaws.services.s3.AmazonS3ClientBuilder;
+import org.apache.druid.storage.s3.S3InputDataConfig;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.easymock.EasyMock;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.Arrays;
+import java.util.List;
+
+public class S3InputSourceAdapterTest
+{
+  @Test
+  public void testAdapterGet()
+  {
+    AmazonS3Client s3Client = EasyMock.createMock(AmazonS3Client.class);
+    ClientConfiguration clientConfiguration = EasyMock.createMock(ClientConfiguration.class);

Review Comment:
   ## Unread local variable
   
   Variable 'ClientConfiguration clientConfiguration' is never read.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4976)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1630361260

   @a2l007 - Looks good to me. Thank you. We have a release branch already cut. I was thinking that maybe you can backport just the core changes. That way, anyone can build the extension and try it on a production release. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260313082


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension

Review Comment:
   ```suggestion
   ## Iceberg Ingest extension
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260482809


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.

Review Comment:
   ```suggestion
   To read from a S3 warehouse, load the `druid-s3-extensions` extension. Druid extracts the data file paths from the Hive metastore catalog and uses `S3InputSource` to ingest these files.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260503001


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set the value to `iceberg`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260538383


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|
+|filters|List of iceberg filters that needs to be AND-ed|yes|

Review Comment:
   ```suggestion
   |`filters`|List of Iceberg filters to include.|Yes|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `and`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260535437


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:

Review Comment:
   ```suggestion
   The `and` filter:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260535905


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260327088


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:

Review Comment:
   ```suggestion
   Iceberg refers to these metastores as catalogs. The Iceberg extension lets you connect to the following Iceberg catalog types:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260529222


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `equals`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260531282


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|

Review Comment:
   ```suggestion
   |`filterValue`|The value to filter on.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260526488


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260346590


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.

Review Comment:
   ```suggestion
   For a given catalog, Iceberg input source reads the table name from the catalog, applies the filters, and extracts all the underlying live data files up to the latest snapshot.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260504487


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|

Review Comment:
   ```suggestion
   |`icebergFilter`|The JSON object that filters data files within a snapshot.|No|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1264485485


##########
distribution/pom.xml:
##########
@@ -258,6 +258,8 @@
                                         <argument>-c</argument>
                                         <argument>org.apache.druid.extensions:druid-kubernetes-extensions</argument>
                                         <argument>-c</argument>
+                                        <argument>org.apache.druid.extensions:druid-iceberg-extensions</argument>

Review Comment:
   this is a contrib extension so we shouldn't be shipping it in the distribution bundle. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260534502


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260527678


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:

Review Comment:
   ```suggestion
   ### IcebergFilter object
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260535057


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|

Review Comment:
   ```suggestion
   |filterColumn|The name of the column from the Iceberg table schema to use for filtering.|Yes|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|

Review Comment:
   ```suggestion
   |filterColumn|The name of the column from the Iceberg table schema to use for filtering.|yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260535057


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|

Review Comment:
   ```suggestion
   |`filterColumn`|The name of the column from the Iceberg table schema to use for filtering.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260473646


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog
+
+The local catalog type can be used for catalogs configured on the local filesystem. The `icebergCatalog` type should be set as `local`. This catalog is useful for demos or localized tests and is not recommended for production use cases.
+This catalog only supports reading from a local filesystem and so the `warehouseSource` is defined as `local`.
+
+### Known limitations

Review Comment:
   ```suggestion
   ## Known limitations
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260526488


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260525968


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|

Review Comment:
   ```suggestion
   |`warehousePath`|The location of the warehouse associated with the catalog.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260523285


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:

Review Comment:
   ```suggestion
   The following table lists the properties of a `local` catalog:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260539096


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|
+|filters|List of iceberg filters that needs to be AND-ed|yes|
+
+`or` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `or`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `or`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260535437


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:

Review Comment:
   ```suggestion
   The `and` filter:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] a2l007 commented on a diff in pull request #14329: Extension to read and ingest iceberg data files

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1225896625


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,120 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata on metadata files, it is still dependent on a metastore for managing a certain amount of metadata.

Review Comment:
   No, I'm referring to a metastore, also known as Iceberg metadata catalog or just Iceberg catalog. I've slightly reworded this, let me know if it helps.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260324391


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.

Review Comment:
   ```suggestion
   Apache Iceberg is an open table format for huge analytic datasets. Although Iceberg manages most of its metadata in metadata files in the object storage, it depends on a metastore for managing a certain amount of metadata.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260330433


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.

Review Comment:
   ```suggestion
   The Iceberg extension lets you ingest data stored in the Iceberg table format into Apache Druid.
   ```
   Consider removing [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) from this introductory sentence. Instead, link to the Iceberg table format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1255052850


##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,314 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>

Review Comment:
   we can probably remove it from the distribution pom.xml under the hadoop2 profile, if needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1636806365

   Though I think there is still a lot of scope for reducing the number of dependencies that this extension has. It has jars for curator, jetty, jersey, protobuf, orc, mysql. There is iceberg spark runtime jar that I can't figure out how will be used. This will become an issue for the release manager as all these extra dependencies are going to have CVEs that require investigation before being suppressed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1630384296

   @a2l007 - I built this locally and the size of the extension directory is 431 MB. Half of that is coming from `aws-java-sdk-bundle-1.12.367.jar`. This jar includes all AWS services. I think it will be better to replace it with an alternative that has just the stuff we require. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1635035009

   > @a2l007 - I built this locally and the size of the extension directory is 431 MB. Half of that is coming from `aws-java-sdk-bundle-1.12.367.jar`. This jar includes all AWS services. I think it will be better to replace it with an alternative that has just the stuff we require.
   
   @abhishekagarwal87 Good catch! I've excluded the aws-java-sdk-bundle and changed the scope for few of the other dependencies.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1264191393


##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -298,23 +298,33 @@
       <artifactId>hadoop-aws</artifactId>
       <version>${hadoop.compile.version}</version>
       <scope>runtime</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>com.amazonaws</groupId>
+          <artifactId>aws-java-sdk-bundle</artifactId>

Review Comment:
   The metastore only needs the hadoop-aws jar which provides the `org.apache.hadoop.fs.s3a.S3AFileSystem` class to resolve the `s3a` client. The s3 druid extension(which has the aws-java-sdk-s3 dependency) takes care of operations on the objects retrieved by the metastore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1263570355


##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -298,23 +298,33 @@
       <artifactId>hadoop-aws</artifactId>
       <version>${hadoop.compile.version}</version>
       <scope>runtime</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>com.amazonaws</groupId>
+          <artifactId>aws-java-sdk-bundle</artifactId>

Review Comment:
   don't you need any aws dependency? For example in hdfs-storage, where we excluded this, we also added below
   ```
           <dependency>
             <groupId>com.amazonaws</groupId>
             <artifactId>aws-java-sdk-s3</artifactId>
             <version>${aws.sdk.version}</version>
             <scope>runtime</scope>
           </dependency>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260358588


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:

Review Comment:
   ```suggestion
   If the Hive metastore supports Kerberos authentication, include the following properties in the `catalogProperties`:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260356977


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.

Review Comment:
   ```suggestion
   The `druid-iceberg-extensions` extension presently only supports HDFS and S3 warehouse directories.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260356267


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 

Review Comment:
   ```suggestion
   You can also specify Hive properties under the `catalogProperties` object in the ingestion spec. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260510492


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|

Review Comment:
   ```suggestion
   |`warehouseSource`|The JSON object that defines the native input source for reading the data files from the warehouse.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260519149


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:

Review Comment:
   ```suggestion
   ### Catalog object
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260525586


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260343275


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension

Review Comment:
   You can completely omit this section if you roll it into the introduction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260343950


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.

Review Comment:
   ```suggestion
   Druid does not support AWS Glue and REST based Iceberg catalogs.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260471570


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.

Review Comment:
   If Kerberos authentication is the only authentication method supported at this time, consider rewriting line 57. Starting the sentence with "if" makes it sound like there are other supported authentication methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260527322


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|

Review Comment:
   ```suggestion
   |`catalogProperties`|Map of any additional properties that needs to be attached to the catalog.|No|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|

Review Comment:
   ```suggestion
   |`catalogUri`|The URI associated with the catalog.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260538864


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|
+|filters|List of iceberg filters that needs to be AND-ed|yes|
+
+`or` Filter:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|
+|filters|List of iceberg filters that needs to be AND-ed|yes|
+
+`or` Filter:

Review Comment:
   ```suggestion
   The `or` filter:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260542481


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:

Review Comment:
   `and`, `or`, and `not` filters all have the same properties. Consider not using tables to present this information or combine `and`, `or`, and `not` into one table.
   
   Also, the definition for the `filters` property is confusing. What exactly do we pass into that property? A column name, a filter name, etc?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1639287004

   Sounds good. I just merged your PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] a2l007 commented on a diff in pull request #14329: Extension to read and ingest iceberg data files

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1225897320


##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.

Review Comment:
   Yes, we create an iceberg table scan and feed it the set of filters before the plan files are identified. Therefore while the files are being planned, it can prune out the list based on the filters provided.



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/HiveIcebergCatalog.java:
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JacksonInject;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import com.google.common.base.Strings;
+import org.apache.druid.iceberg.guice.HiveConf;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.hive.HiveCatalog;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Hive Metastore specific implementation of iceberg catalog.
+ * Kerberos authentication is performed if the credentials are provided in the catalog properties
+ */
+public class HiveIcebergCatalog extends IcebergCatalog
+{
+  public static final String TYPE_KEY = "hive";
+
+  @JsonProperty
+  private String warehousePath;
+
+  @JsonProperty
+  private String catalogUri;
+
+  @JsonProperty
+  private Map<String, String> catalogProperties;
+
+  private final Configuration configuration;
+
+  private BaseMetastoreCatalog hiveCatalog;
+
+  private static final Logger log = new Logger(HiveIcebergCatalog.class);
+
+  @JsonCreator
+  public HiveIcebergCatalog(
+      @JsonProperty("warehousePath") String warehousePath,
+      @JsonProperty("catalogUri") String catalogUri,
+      @JsonProperty("catalogProperties") @Nullable
+          Map<String, String> catalogProperties,
+      @JacksonInject @HiveConf Configuration configuration
+  )
+  {
+    this.warehousePath = Preconditions.checkNotNull(warehousePath, "warehousePath cannot be null");
+    this.catalogUri = Preconditions.checkNotNull(catalogUri, "catalogUri cannot be null");
+    this.catalogProperties = catalogProperties != null ? catalogProperties : new HashMap<>();
+    this.configuration = configuration;
+    this.catalogProperties
+        .forEach(this.configuration::set);
+    this.hiveCatalog = retrieveCatalog();
+  }
+
+  @Override
+  public BaseMetastoreCatalog retrieveCatalog()
+  {
+    if (hiveCatalog == null) {
+      hiveCatalog = setupCatalog();
+    }
+    return hiveCatalog;
+  }
+
+  private HiveCatalog setupCatalog()
+  {
+    HiveCatalog catalog = new HiveCatalog();
+    authenticate();
+    catalog.setConf(configuration);
+    catalogProperties.put("warehouse", warehousePath);
+    catalogProperties.put("uri", catalogUri);
+    catalog.initialize("hive", catalogProperties);
+    return catalog;
+  }
+
+  private void authenticate()
+  {
+    String principal = catalogProperties.getOrDefault("principal", null);

Review Comment:
   Added a line in the doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1243457526


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.

Review Comment:
   ```suggestion
   For a given catalog, iceberg table name, and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters, and extracting all the underlying live data files up to the latest snapshot.
   ```



##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.

Review Comment:
   ```suggestion
   Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.

Review Comment:
   does it require S3 extension though because behind the scenes, its using hadoop-aws module? 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/common/IcebergDruidModule.java:
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.common;
+
+import com.fasterxml.jackson.databind.Module;
+import com.fasterxml.jackson.databind.jsontype.NamedType;
+import com.fasterxml.jackson.databind.module.SimpleModule;
+import com.google.inject.Binder;
+import org.apache.druid.iceberg.guice.HiveConf;
+import org.apache.druid.iceberg.input.HiveIcebergCatalog;
+import org.apache.druid.iceberg.input.IcebergInputSource;
+import org.apache.druid.iceberg.input.LocalCatalog;
+import org.apache.druid.initialization.DruidModule;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+
+import java.util.Collections;
+import java.util.List;
+
+public class IcebergDruidModule implements DruidModule
+{
+  @Override
+  public List<? extends Module> getJacksonModules()
+  {
+    return Collections.singletonList(
+        new SimpleModule("IcebergDruidModule")
+            .registerSubtypes(
+                new NamedType(HiveIcebergCatalog.class, HiveIcebergCatalog.TYPE_KEY),
+                new NamedType(LocalCatalog.class, LocalCatalog.TYPE_KEY),
+                new NamedType(IcebergInputSource.class, IcebergInputSource.TYPE_KEY)
+
+            )
+    );
+  }
+
+  @Override
+  public void configure(Binder binder)
+  {
+    final Configuration conf = new Configuration();
+    conf.setClassLoader(getClass().getClassLoader());
+
+    ClassLoader currCtxCl = Thread.currentThread().getContextClassLoader();
+    try {
+      Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
+      FileSystem.get(conf);
+    }
+    catch (Exception ex) {
+      throw new RuntimeException(ex);

Review Comment:
   can you use newly introduced `DruidException`? We are standardizing on that exception to encourage better user-facing error messages and in general more context about errors. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.joda.time.Interval;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class IcebergIntervalFilter implements IcebergFilter
+{
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final List<Interval> intervals;
+
+  @JsonCreator
+  public IcebergIntervalFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("intervals") List<Interval> intervals
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");
+    Preconditions.checkNotNull(intervals, "intervals can not be null");

Review Comment:
   we should throw the same error for intervals being empty. Also since its a user-facing error we should make the message a bit more clear. E.g. "You must specify intervals on the interval iceberg filter"



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.joda.time.Interval;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class IcebergIntervalFilter implements IcebergFilter
+{
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final List<Interval> intervals;
+
+  @JsonCreator
+  public IcebergIntervalFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("intervals") List<Interval> intervals
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");
+    Preconditions.checkNotNull(intervals, "intervals can not be null");
+    this.filterColumn = filterColumn;
+    this.intervals = intervals;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    for (Interval filterInterval : intervals) {
+      Long dateStart = (long) Literal.of(filterInterval.getStart().toString())
+                                     .to(Types.TimestampType.withZone())
+                                     .value();
+      Long dateEnd = (long) Literal.of(filterInterval.getEnd().toString())
+                                   .to(Types.TimestampType.withZone())
+                                   .value();
+
+      expressions.add(Expressions.and(
+          Expressions.greaterThanOrEqual(
+              filterColumn,
+              dateStart
+          ),
+          Expressions.lessThan(
+              filterColumn,
+              dateEnd
+          )
+      ));
+    }
+    Expression finalExpr = Expressions.alwaysFalse();

Review Comment:
   is this extra expression required? 



##########
processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java:
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.impl.LocalInputSourceAdapter;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.java.util.common.CloseableIterators;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.parsers.CloseableIterator;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.

Review Comment:
   I think this class could be called LazyInputSourceBuilder or something to that effect. since it doesn't seem like an adapter. its primary responsibility is lazy on-demand instantiation of input sources. 



##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,313 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>${hadoop.compile.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>io.netty</groupId>
+          <artifactId>netty-buffer</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-cli</groupId>
+          <artifactId>commons-cli</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>log4j</groupId>
+          <artifactId>log4j</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-codec</groupId>
+          <artifactId>commons-codec</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-logging</groupId>
+          <artifactId>commons-logging</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-io</groupId>
+          <artifactId>commons-io</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-lang</groupId>
+          <artifactId>commons-lang</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpclient</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpcore</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.zookeeper</groupId>
+          <artifactId>zookeeper</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-log4j12</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>javax.ws.rs</groupId>
+          <artifactId>jsr311-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.code.findbugs</groupId>
+          <artifactId>jsr305</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty-util</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.protobuf</groupId>
+          <artifactId>protobuf-java</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.curator</groupId>
+          <artifactId>curator-client</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.commons</groupId>
+          <artifactId>commons-math3</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.guava</groupId>
+          <artifactId>guava</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.avro</groupId>
+          <artifactId>avro</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>net.java.dev.jets3t</groupId>
+          <artifactId>jets3t</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-json</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.jcraft</groupId>
+          <artifactId>jsch</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-server</artifactId>
+        </exclusion>
+        <!-- Following are excluded to remove security vulnerabilities: -->
+        <exclusion>
+          <groupId>commons-beanutils</groupId>
+          <artifactId>commons-beanutils-core</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.iceberg</groupId>
+      <artifactId>iceberg-spark-runtime-3.3_2.12</artifactId>
+      <version>1.0.0</version>
+      <exclusions>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hive</groupId>
+      <artifactId>hive-metastore</artifactId>
+      <version>3.1.3</version>
+      <exclusions>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-hdfs</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hbase</groupId>
+          <artifactId>hbase-client</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-processing</artifactId>
+      <version>${project.parent.version}</version>
+      <scope>provided</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-annotations</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-databind</artifactId>
+      <scope>provided</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-hdfs-client</artifactId>
+      <scope>runtime</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-mapreduce-client-core</artifactId>

Review Comment:
   my understanding is that any hadoop client requires hadoop-client-api and hadoop-client-runtime which are like uber shaded jars. would that not work here? 



##########
processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java:
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.impl.LocalInputSourceAdapter;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.java.util.common.CloseableIterators;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.parsers.CloseableIterator;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.

Review Comment:
   thinking a bit more about it, memoization doesn't require a class of its own at all. That's something IcebergInputSource can do itself. So all we require is the ability to generate an input source dynamically. and a single-method interface is good enough to achieve that. We can call it `FileInputSourceBuilder` or `FileInputSourceGenerator`. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergAndFilter.java:
##########
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.LinkedHashSet;
+import java.util.List;
+
+public class IcebergAndFilter implements IcebergFilter
+{
+
+  private final List<IcebergFilter> filters;
+
+  private static final Logger log = new Logger(IcebergAndFilter.class);
+
+  @JsonCreator
+  public IcebergAndFilter(
+      @JsonProperty("filters") List<IcebergFilter> filters
+  )
+  {
+    Preconditions.checkArgument(filters != null && filters.size() > 0, "filter requires atleast one field");
+    this.filters = filters;
+  }
+
+  @JsonProperty
+  public List<IcebergFilter> getFilters()
+  {
+    return filters;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    LinkedHashSet<IcebergFilter> flatFilters = flattenAndChildren(filters);
+    if (!flatFilters.isEmpty()) {
+      for (IcebergFilter filter : flatFilters) {
+        expressions.add(filter.getFilterExpression());
+      }
+    } else {
+      log.error("Empty filter set, running iceberg table scan without filters");

Review Comment:
   that's not an error. In fact, we shouldn't ever get here at all so might as well remove this else block entirely. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.joda.time.Interval;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class IcebergIntervalFilter implements IcebergFilter
+{
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final List<Interval> intervals;
+
+  @JsonCreator
+  public IcebergIntervalFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("intervals") List<Interval> intervals
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");
+    Preconditions.checkNotNull(intervals, "intervals can not be null");
+    this.filterColumn = filterColumn;
+    this.intervals = intervals;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    for (Interval filterInterval : intervals) {
+      Long dateStart = (long) Literal.of(filterInterval.getStart().toString())
+                                     .to(Types.TimestampType.withZone())
+                                     .value();
+      Long dateEnd = (long) Literal.of(filterInterval.getEnd().toString())
+                                   .to(Types.TimestampType.withZone())
+                                   .value();
+
+      expressions.add(Expressions.and(

Review Comment:
   we should make it clear in the docs that start is inclusive but end is not. 



##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",

Review Comment:
   can you confirm that these get masked when we log these properties or when someone looks at the ingestion spec? 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergNotFilter.java:
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+
+
+public class IcebergNotFilter implements IcebergFilter
+{
+  private final IcebergFilter filter;
+
+  @JsonCreator
+  public IcebergNotFilter(
+      @JsonProperty("filter") IcebergFilter filter
+  )
+  {
+    Preconditions.checkNotNull(filter, "filter cannot be null");

Review Comment:
   same comment about the error message. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.
+ */
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = InputFormat.TYPE_PROPERTY)
+public abstract class IcebergCatalog
+{
+  private static final Logger log = new Logger(IcebergCatalog.class);
+
+  public abstract BaseMetastoreCatalog retrieveCatalog();
+
+  /**
+   * Extract the iceberg data files upto the latest snapshot associated with the table
+   *
+   * @param tableNamespace The catalog namespace under which the table is defined
+   * @param tableName      The iceberg table name
+   * @return a list of data file paths
+   */
+  public List<String> extractSnapshotDataFiles(
+      String tableNamespace,
+      String tableName,
+      IcebergFilter icebergFilter
+  )
+  {
+    Catalog catalog = retrieveCatalog();
+    Namespace namespace = Namespace.of(tableNamespace);
+    String tableIdentifier = tableNamespace + "." + tableName;
+
+    List<String> dataFilePaths = new ArrayList<>();
+
+    ClassLoader currCtxClassloader = Thread.currentThread().getContextClassLoader();
+    try {
+      Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
+      TableIdentifier icebergTableIdentifier = catalog.listTables(namespace).stream()
+                                                      .filter(tableId -> tableId.toString().equals(tableIdentifier))
+                                                      .findFirst()
+                                                      .orElseThrow(() -> new IAE(
+                                                          " Couldn't retrieve table identifier for '%s'",
+                                                          tableIdentifier
+                                                      ));
+
+      long start = System.currentTimeMillis();
+      TableScan tableScan = catalog.loadTable(icebergTableIdentifier).newScan();
+
+      if (icebergFilter != null) {
+        tableScan = icebergFilter.filter(tableScan);
+      }
+
+      CloseableIterable<FileScanTask> tasks = tableScan.planFiles();
+      CloseableIterable.transform(tasks, FileScanTask::file)
+                       .forEach(dataFile -> dataFilePaths.add(dataFile.path().toString()));
+
+      long duration = System.currentTimeMillis() - start;
+      log.info("Data file scan and fetch took %d ms for %d paths", duration, dataFilePaths.size());

Review Comment:
   ```suggestion
         log.info("Data file scan and fetch took [%d ms] time for [%d] paths", duration, dataFilePaths.size());
   ```



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.joda.time.Interval;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class IcebergIntervalFilter implements IcebergFilter
+{
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final List<Interval> intervals;
+
+  @JsonCreator
+  public IcebergIntervalFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("intervals") List<Interval> intervals
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");
+    Preconditions.checkNotNull(intervals, "intervals can not be null");
+    this.filterColumn = filterColumn;
+    this.intervals = intervals;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    for (Interval filterInterval : intervals) {
+      Long dateStart = (long) Literal.of(filterInterval.getStart().toString())
+                                     .to(Types.TimestampType.withZone())

Review Comment:
   what does this call do? 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.
+ */
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = InputFormat.TYPE_PROPERTY)
+public abstract class IcebergCatalog
+{
+  private static final Logger log = new Logger(IcebergCatalog.class);
+
+  public abstract BaseMetastoreCatalog retrieveCatalog();
+
+  /**
+   * Extract the iceberg data files upto the latest snapshot associated with the table
+   *
+   * @param tableNamespace The catalog namespace under which the table is defined
+   * @param tableName      The iceberg table name
+   * @return a list of data file paths
+   */
+  public List<String> extractSnapshotDataFiles(
+      String tableNamespace,
+      String tableName,
+      IcebergFilter icebergFilter
+  )
+  {
+    Catalog catalog = retrieveCatalog();
+    Namespace namespace = Namespace.of(tableNamespace);
+    String tableIdentifier = tableNamespace + "." + tableName;
+
+    List<String> dataFilePaths = new ArrayList<>();
+
+    ClassLoader currCtxClassloader = Thread.currentThread().getContextClassLoader();
+    try {
+      Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
+      TableIdentifier icebergTableIdentifier = catalog.listTables(namespace).stream()
+                                                      .filter(tableId -> tableId.toString().equals(tableIdentifier))
+                                                      .findFirst()
+                                                      .orElseThrow(() -> new IAE(
+                                                          " Couldn't retrieve table identifier for '%s'",

Review Comment:
   is there an action that the user can take to remedy this error? Like verifying that the table indeed exists in the catalog. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergOrFilter.java:
##########
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.LinkedHashSet;
+import java.util.List;
+
+public class IcebergOrFilter implements IcebergFilter
+{
+  private final List<IcebergFilter> filters;
+
+  private static final Logger log = new Logger(IcebergAndFilter.class);
+
+  @JsonCreator
+  public IcebergOrFilter(
+      @JsonProperty("filters") List<IcebergFilter> filters
+  )
+  {
+    Preconditions.checkArgument(filters != null && filters.size() > 0, "filter requires atleast one field");
+    this.filters = filters;
+  }
+
+  @JsonProperty
+  public List<IcebergFilter> getFilters()
+  {
+    return filters;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    LinkedHashSet<IcebergFilter> flatFilters = flattenOrChildren(filters);
+    if (!flatFilters.isEmpty()) {
+      for (IcebergFilter filter : flatFilters) {
+        expressions.add(filter.getFilterExpression());
+      }
+    } else {
+      log.error("Empty filter set, running iceberg table scan without filters");

Review Comment:
   same comment about the error message. 



##########
processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java:
##########
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.impl.LocalInputSourceAdapter;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.java.util.common.CloseableIterators;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.parsers.CloseableIterator;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.
+ * For composing input sources such as IcebergInputSource, the delegate input source instantiation might fail upon deserialization since the input file paths
+ * are not available yet and this might fail the input source precondition checks.
+ * This adapter helps create the delegate input source once the input file paths are fully determined.
+ */
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
+@JsonSubTypes(value = {
+    @JsonSubTypes.Type(name = LocalInputSourceAdapter.TYPE_KEY, value = LocalInputSourceAdapter.class)
+})
+public abstract class AbstractInputSourceAdapter
+{
+  private SplittableInputSource inputSource;
+
+  public abstract SplittableInputSource generateInputSource(List<String> inputFilePaths);
+
+  public void setupInputSource(List<String> inputFilePaths)
+  {
+    if (inputSource != null) {
+      throw new ISE("Inputsource is already initialized!");
+    }
+    if (inputFilePaths.isEmpty()) {
+      inputSource = new EmptyInputSource();
+    } else {
+      inputSource = generateInputSource(inputFilePaths);
+    }
+  }
+
+  public SplittableInputSource getInputSource()
+  {
+    if (inputSource == null) {
+      throw new ISE("Inputsource is not initialized yet!");
+    }
+    return inputSource;
+  }
+
+  private static class EmptyInputSource implements SplittableInputSource

Review Comment:
   you should add a note here that this class exists because some underlying input sources might not accept an empty list of input sources. While an empty list is possible when working with iceberg. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.
+ */
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = InputFormat.TYPE_PROPERTY)
+public abstract class IcebergCatalog
+{
+  private static final Logger log = new Logger(IcebergCatalog.class);
+
+  public abstract BaseMetastoreCatalog retrieveCatalog();
+
+  /**
+   * Extract the iceberg data files upto the latest snapshot associated with the table

Review Comment:
   what does extracting mean exactly? It returns a list of remote file paths? 



##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,314 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>

Review Comment:
   not required. though can we not even add this module if hadoop2 profile is activated? Assuming such a thing is possible. 



##########
processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java:
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.impl.LocalInputSourceAdapter;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.java.util.common.CloseableIterators;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.parsers.CloseableIterator;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.

Review Comment:
   In fact, this class could be split into one concrete class that does memoization and one interface that has a `build` method. The extensions can just implement the interface. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1255048950


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",

Review Comment:
   It wasn't masked earlier, i've added support for the dynamicconfigprovider now, so it should be good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1255048067


##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.joda.time.Interval;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class IcebergIntervalFilter implements IcebergFilter
+{
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final List<Interval> intervals;
+
+  @JsonCreator
+  public IcebergIntervalFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("intervals") List<Interval> intervals
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");
+    Preconditions.checkNotNull(intervals, "intervals can not be null");
+    this.filterColumn = filterColumn;
+    this.intervals = intervals;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    for (Interval filterInterval : intervals) {
+      Long dateStart = (long) Literal.of(filterInterval.getStart().toString())
+                                     .to(Types.TimestampType.withZone())

Review Comment:
   Iceberg timestamp type column supports microsecond precision and so, this converts the input timestamp string into iceberg TimestampType. This is to ensure that there are no precision mismatches when doing the comparison.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260473174


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 

Review Comment:
   ```suggestion
   ### Read from HDFS warehouse 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260498192


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog
+
+The local catalog type can be used for catalogs configured on the local filesystem. The `icebergCatalog` type should be set as `local`. This catalog is useful for demos or localized tests and is not recommended for production use cases.
+This catalog only supports reading from a local filesystem and so the `warehouseSource` is defined as `local`.
+
+### Known limitations
+
+This extension does not presently fully utilize the iceberg features such as snapshotting or schema evolution. Following are the current limitations of this extension:
+
+- The `IcebergInputSource` reads every single live file on the iceberg table up to the latest snapshot, which makes the table scan less performant. It is recommended to use iceberg filters on partition columns in the ingestion spec in order to limit the number of data files being retrieved. Since, Druid doesn't store the last ingested iceberg snapshot ID, it cannot identify the files created between that snapshot and the latest snapshot on iceberg.
+- It does not handle iceberg [schema evolution](https://iceberg.apache.org/docs/latest/evolution/) yet. In cases where an existing iceberg table column is deleted and recreated with the same name, ingesting this table into Druid may bring the data for this column before it was deleted.

Review Comment:
   ```suggestion
   - It does not handle Iceberg [schema evolution](https://iceberg.apache.org/docs/latest/evolution/) yet. In cases where an existing Iceberg table column is deleted and recreated with the same name, ingesting this table into Druid may bring the data for this column before it was deleted.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260529222


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260526833


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|

Review Comment:
   ```suggestion
   |`warehousePath`|The location of the warehouse associated with the catalog.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260525774


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `local`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260498315


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog
+
+The local catalog type can be used for catalogs configured on the local filesystem. The `icebergCatalog` type should be set as `local`. This catalog is useful for demos or localized tests and is not recommended for production use cases.
+This catalog only supports reading from a local filesystem and so the `warehouseSource` is defined as `local`.
+
+### Known limitations
+
+This extension does not presently fully utilize the iceberg features such as snapshotting or schema evolution. Following are the current limitations of this extension:
+
+- The `IcebergInputSource` reads every single live file on the iceberg table up to the latest snapshot, which makes the table scan less performant. It is recommended to use iceberg filters on partition columns in the ingestion spec in order to limit the number of data files being retrieved. Since, Druid doesn't store the last ingested iceberg snapshot ID, it cannot identify the files created between that snapshot and the latest snapshot on iceberg.
+- It does not handle iceberg [schema evolution](https://iceberg.apache.org/docs/latest/evolution/) yet. In cases where an existing iceberg table column is deleted and recreated with the same name, ingesting this table into Druid may bring the data for this column before it was deleted.
+- The Hive catalog has not been tested on Hadoop 2.x.x and therefore is not guaranteed to work with Hadoop 2.

Review Comment:
   ```suggestion
   - The Hive catalog has not been tested on Hadoop 2.x.x and is not guaranteed to work with Hadoop 2.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260504487


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|

Review Comment:
   ```suggestion
   |`icebergFilter`|The JSON object used to filter data files within a snapshot.|No|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260535259


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|

Review Comment:
   ```suggestion
   |`intervals`|A JSON array containing ISO 8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260530944


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|

Review Comment:
   ```suggestion
   |`filterColumn`|The name of the column from the Iceberg table schema to use for filtering.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260472878


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog

Review Comment:
   ```suggestion
   ## Hive metastore catalog
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260534502


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `interval`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260538383


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|
+|filters|List of iceberg filters that needs to be AND-ed|yes|

Review Comment:
   ```suggestion
   |`filters`|List of Iceberg filters to include.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 merged PR #14329:
URL: https://github.com/apache/druid/pull/14329


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] cryptoe commented on a diff in pull request #14329: Extension to read and ingest iceberg data files

Posted by "cryptoe (via GitHub)" <gi...@apache.org>.

cryptoe commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1217950267


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,120 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata on metadata files, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, The IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath.  

Review Comment:
   Where are they needed. I am assuming they are only needed on the peon's ?



##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,120 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata on metadata files, it is still dependent on a metastore for managing a certain amount of metadata.

Review Comment:
   This might need rephrasing. Did you mean a metadata store here ?



##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,314 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>

Review Comment:
   Nit: do we require an empty block here ?



##########
processing/src/main/java/org/apache/druid/data/input/AbstractInputSourceAdapter.java:
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.impl.LocalInputSourceAdapter;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.java.util.common.CloseableIterators;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.parsers.CloseableIterator;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * A wrapper on top of {@link SplittableInputSource} that handles input source creation.

Review Comment:
   I did not understand the intent of this class. More details in the class level docs would be helpful. 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/HiveIcebergCatalog.java:
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JacksonInject;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import com.google.common.base.Strings;
+import org.apache.druid.iceberg.guice.HiveConf;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.hive.HiveCatalog;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Hive Metastore specific implementation of iceberg catalog.
+ * Kerberos authentication is performed if the credentials are provided in the catalog properties
+ */
+public class HiveIcebergCatalog extends IcebergCatalog
+{
+  public static final String TYPE_KEY = "hive";
+
+  @JsonProperty
+  private String warehousePath;
+
+  @JsonProperty
+  private String catalogUri;
+
+  @JsonProperty
+  private Map<String, String> catalogProperties;
+
+  private final Configuration configuration;
+
+  private BaseMetastoreCatalog hiveCatalog;
+
+  private static final Logger log = new Logger(HiveIcebergCatalog.class);
+
+  @JsonCreator
+  public HiveIcebergCatalog(
+      @JsonProperty("warehousePath") String warehousePath,
+      @JsonProperty("catalogUri") String catalogUri,
+      @JsonProperty("catalogProperties") @Nullable
+          Map<String, String> catalogProperties,
+      @JacksonInject @HiveConf Configuration configuration
+  )
+  {
+    this.warehousePath = Preconditions.checkNotNull(warehousePath, "warehousePath cannot be null");
+    this.catalogUri = Preconditions.checkNotNull(catalogUri, "catalogUri cannot be null");
+    this.catalogProperties = catalogProperties != null ? catalogProperties : new HashMap<>();
+    this.configuration = configuration;
+    this.catalogProperties
+        .forEach(this.configuration::set);
+    this.hiveCatalog = retrieveCatalog();
+  }
+
+  @Override
+  public BaseMetastoreCatalog retrieveCatalog()
+  {
+    if (hiveCatalog == null) {
+      hiveCatalog = setupCatalog();
+    }
+    return hiveCatalog;
+  }
+
+  private HiveCatalog setupCatalog()
+  {
+    HiveCatalog catalog = new HiveCatalog();
+    authenticate();

Review Comment:
   Do we need to handle remote http/rpc related exceptions here ?



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/HiveIcebergCatalog.java:
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JacksonInject;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import com.google.common.base.Strings;
+import org.apache.druid.iceberg.guice.HiveConf;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.hive.HiveCatalog;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Hive Metastore specific implementation of iceberg catalog.
+ * Kerberos authentication is performed if the credentials are provided in the catalog properties
+ */
+public class HiveIcebergCatalog extends IcebergCatalog
+{
+  public static final String TYPE_KEY = "hive";
+
+  @JsonProperty
+  private String warehousePath;
+
+  @JsonProperty
+  private String catalogUri;
+
+  @JsonProperty
+  private Map<String, String> catalogProperties;
+
+  private final Configuration configuration;
+
+  private BaseMetastoreCatalog hiveCatalog;
+
+  private static final Logger log = new Logger(HiveIcebergCatalog.class);
+
+  @JsonCreator
+  public HiveIcebergCatalog(
+      @JsonProperty("warehousePath") String warehousePath,
+      @JsonProperty("catalogUri") String catalogUri,
+      @JsonProperty("catalogProperties") @Nullable
+          Map<String, String> catalogProperties,
+      @JacksonInject @HiveConf Configuration configuration
+  )
+  {
+    this.warehousePath = Preconditions.checkNotNull(warehousePath, "warehousePath cannot be null");
+    this.catalogUri = Preconditions.checkNotNull(catalogUri, "catalogUri cannot be null");
+    this.catalogProperties = catalogProperties != null ? catalogProperties : new HashMap<>();
+    this.configuration = configuration;
+    this.catalogProperties
+        .forEach(this.configuration::set);
+    this.hiveCatalog = retrieveCatalog();
+  }
+
+  @Override
+  public BaseMetastoreCatalog retrieveCatalog()
+  {
+    if (hiveCatalog == null) {
+      hiveCatalog = setupCatalog();
+    }
+    return hiveCatalog;
+  }
+
+  private HiveCatalog setupCatalog()
+  {
+    HiveCatalog catalog = new HiveCatalog();
+    authenticate();
+    catalog.setConf(configuration);
+    catalogProperties.put("warehouse", warehousePath);
+    catalogProperties.put("uri", catalogUri);
+    catalog.initialize("hive", catalogProperties);
+    return catalog;
+  }
+
+  private void authenticate()
+  {
+    String principal = catalogProperties.getOrDefault("principal", null);

Review Comment:
   Are there other types of authentication methods or only we have support for krb5 in the initial version. 
   In any case we should document this explicitly. 



##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,314 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>

Review Comment:
   I donot see hadoop 2/ hadoop 3 profiles. For reference you can have a look here : https://github.com/apache/druid/blob/master/extensions-core/hdfs-storage/pom.xml#L142



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.

Review Comment:
   Where is the `icebergFilter`expression filtering happening. 
   Does the filtering happen while pruning the list of the data files that need to be fetched?



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.
+ */
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = InputFormat.TYPE_PROPERTY)
+public abstract class IcebergCatalog
+{
+  private static final Logger log = new Logger(IcebergCatalog.class);
+
+  public abstract BaseMetastoreCatalog retrieveCatalog();
+
+  /**
+   * Extract the iceberg data files upto the latest snapshot associated with the table
+   *
+   * @param tableNamespace The catalog namespace under which the table is defined
+   * @param tableName      The iceberg table name
+   * @return a list of data file paths
+   */
+  public List<String> extractSnapshotDataFiles(
+      String tableNamespace,
+      String tableName,
+      IcebergFilter icebergFilter
+  )
+  {
+    Catalog catalog = retrieveCatalog();
+    Namespace namespace = Namespace.of(tableNamespace);
+    String tableIdentifier = tableNamespace + "." + tableName;
+
+    Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
+    TableIdentifier icebergTableIdentifier = catalog.listTables(namespace).stream()

Review Comment:
   I think this call needs a special error handling to let the user know that there is some connectivity issue or bad configuration is passed. 



##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,314 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>${hadoop.compile.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>io.netty</groupId>
+          <artifactId>netty-buffer</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-cli</groupId>
+          <artifactId>commons-cli</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>log4j</groupId>
+          <artifactId>log4j</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-codec</groupId>
+          <artifactId>commons-codec</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-logging</groupId>
+          <artifactId>commons-logging</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-io</groupId>
+          <artifactId>commons-io</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-lang</groupId>
+          <artifactId>commons-lang</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpclient</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpcore</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.zookeeper</groupId>
+          <artifactId>zookeeper</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-log4j12</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>javax.ws.rs</groupId>
+          <artifactId>jsr311-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.code.findbugs</groupId>
+          <artifactId>jsr305</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty-util</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.protobuf</groupId>
+          <artifactId>protobuf-java</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.curator</groupId>
+          <artifactId>curator-client</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.commons</groupId>
+          <artifactId>commons-math3</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.guava</groupId>
+          <artifactId>guava</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.avro</groupId>
+          <artifactId>avro</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>net.java.dev.jets3t</groupId>
+          <artifactId>jets3t</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-json</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.jcraft</groupId>
+          <artifactId>jsch</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-server</artifactId>
+        </exclusion>
+        <!-- Following are excluded to remove security vulnerabilities: -->
+        <exclusion>
+          <groupId>commons-beanutils</groupId>
+          <artifactId>commons-beanutils-core</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.iceberg</groupId>
+      <artifactId>iceberg-spark-runtime-3.3_2.12</artifactId>
+      <version>1.0.0</version>

Review Comment:
   All these versions should go in the main pom 



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.RE;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.io.CloseableIterable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/*
+ * Druid wrapper for an iceberg catalog.
+ * The configured catalog is used to load the specified iceberg table and retrieve the underlying live data files upto the latest snapshot.
+ * This does not perform any projections on the table yet, therefore all the underlying columns will be retrieved from the data files.
+ */
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = InputFormat.TYPE_PROPERTY)
+public abstract class IcebergCatalog
+{
+  private static final Logger log = new Logger(IcebergCatalog.class);
+
+  public abstract BaseMetastoreCatalog retrieveCatalog();
+
+  /**
+   * Extract the iceberg data files upto the latest snapshot associated with the table
+   *
+   * @param tableNamespace The catalog namespace under which the table is defined
+   * @param tableName      The iceberg table name
+   * @return a list of data file paths
+   */
+  public List<String> extractSnapshotDataFiles(
+      String tableNamespace,
+      String tableName,
+      IcebergFilter icebergFilter
+  )
+  {
+    Catalog catalog = retrieveCatalog();
+    Namespace namespace = Namespace.of(tableNamespace);
+    String tableIdentifier = tableNamespace + "." + tableName;
+
+    Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
+    TableIdentifier icebergTableIdentifier = catalog.listTables(namespace).stream()
+                                                    .filter(tableId -> tableId.toString().equals(tableIdentifier))
+                                                    .findFirst()
+                                                    .orElseThrow(() -> new IAE(
+                                                        " Couldn't retrieve table identifier for '%s'",
+                                                        tableIdentifier
+                                                    ));
+
+    long start = System.currentTimeMillis();
+    List<String> dataFilePaths = new ArrayList<>();
+    try {
+      TableScan tableScan = catalog.loadTable(icebergTableIdentifier).newScan();
+
+      if (icebergFilter != null) {
+        tableScan = icebergFilter.filter(tableScan);
+      }
+
+      CloseableIterable<FileScanTask> tasks = tableScan.planFiles();
+      CloseableIterable.transform(tasks, FileScanTask::file)
+                       .forEach(dataFile -> dataFilePaths.add(dataFile.path().toString()));
+
+      long duration = System.currentTimeMillis() - start;
+      log.info("Data file scan and fetch took %d ms", duration);

Review Comment:
   You could also log the number of `dataFilePaths` here



##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:
##########
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.input;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.druid.data.input.AbstractInputSourceAdapter;
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.data.input.InputRowSchema;
+import org.apache.druid.data.input.InputSource;
+import org.apache.druid.data.input.InputSourceReader;
+import org.apache.druid.data.input.InputSplit;
+import org.apache.druid.data.input.SplitHintSpec;
+import org.apache.druid.data.input.impl.SplittableInputSource;
+import org.apache.druid.iceberg.filter.IcebergFilter;
+
+import javax.annotation.Nullable;
+import java.io.File;
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Stream;
+
+/**
+ * Inputsource to ingest data managed by the Iceberg table format.
+ * This inputsource talks to the configured catalog, executes any configured filters and retrieves the data file paths upto the latest snapshot associated with the iceberg table.
+ * The data file paths are then provided to a native {@link SplittableInputSource} implementation depending on the warehouse source defined.
+ */
+public class IcebergInputSource implements SplittableInputSource<List<String>>
+{
+  public static final String TYPE_KEY = "iceberg";
+
+  @JsonProperty
+  private final String tableName;
+
+  @JsonProperty
+  private final String namespace;
+
+  @JsonProperty
+  private IcebergCatalog icebergCatalog;
+
+  @JsonProperty
+  private IcebergFilter icebergFilter;
+
+  @JsonProperty
+  private AbstractInputSourceAdapter warehouseSource;
+
+  private boolean isLoaded = false;
+
+  @JsonCreator
+  public IcebergInputSource(
+      @JsonProperty("tableName") String tableName,
+      @JsonProperty("namespace") String namespace,
+      @JsonProperty("icebergFilter") @Nullable IcebergFilter icebergFilter,
+      @JsonProperty("icebergCatalog") IcebergCatalog icebergCatalog,
+      @JsonProperty("warehouseSource") AbstractInputSourceAdapter warehouseSource

Review Comment:
   Can this be another input source here?
   We can do some validation's here to see if its only local and s3 input source



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260488248


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog
+
+The local catalog type can be used for catalogs configured on the local filesystem. The `icebergCatalog` type should be set as `local`. This catalog is useful for demos or localized tests and is not recommended for production use cases.

Review Comment:
   ```suggestion
   The local catalog type can be used for catalogs configured on the local filesystem. Set the `icebergCatalog` type to `local`. You can use this catalog for demos or localized tests. It is not recommended for production use cases.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260342485


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension

Review Comment:
   I don't think this heading is necessary. If you delete this heading, you can move the other headings up a level.
   
   Since the topic is about Iceberg ingestion, consider introducing the feature first and then talk about the extension as a means of enabling the feature. For example:
   
   Apache Iceberg is an open table format for huge analytic datasets. [Iceberg input source](../../ingestion/input-sources.md#iceberg-input-source) lets your ingest data stored in the Iceberg table format into Apache Druid. To  enable Iceberg input source, add the `druid-iceberg-extensions` extension to the list of extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
   
   Iceberg manages most of its metadata in metadata files in the object storage. In some cases, it uses a metastore to manage a certain amount of metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260536069


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|intervals|A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. The start interval is inclusive and the end interval is exclusive. |yes|
+
+`and` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `and`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `and`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260535057


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|

Review Comment:
   ```suggestion
   |`filterColumn`|The name of the column from the Iceberg table schema to use for filtering.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260534646


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|
+|filterColumn|The column name from the iceberg table schema based on which filtering needs to happen|yes|
+|filterValue|The value to filter on|yes|
+
+`interval` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `interval`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `interval`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-1636803918

   @a2l007 - PR looks good to me. I will let you merge it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260358113


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.

Review Comment:
   ```suggestion
   To read from a HDFS warehouse, load the `druid-hdfs-storage` extension. Druid extracts data file paths from the Hive metastore catalog and uses [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) to ingest these files.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260348931


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.

Review Comment:
   ```suggestion
   The data files can be in Parquet, ORC, or Avro formats. The data files typically reside in a warehouse location, which can be in HDFS, S3, or the local filesystem.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260502429


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.

Review Comment:
   ```suggestion
   For example, if the warehouse associated with an Iceberg catalog is on S3, you must also load the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260501397


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.

Review Comment:
   ```suggestion
   The Iceberg input source cannot be independent as it relies on the existing input sources to read from the data files.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260514925


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+

Review Comment:
   ```suggestion
   
   The following is a sample spec for a S3 warehouse source:
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:

Review Comment:
   ```suggestion
   The following is a sample spec for a HDFS warehouse source:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260502738


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|

Review Comment:
   ```suggestion
   |**Property**|**Description**|**Required**|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260496160


##########
docs/development/extensions-contrib/iceberg.md:
##########
@@ -0,0 +1,121 @@
+---
+id: iceberg 
+title: "Iceberg"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Iceberg Ingest Extension
+
+This extension provides [IcebergInputSource](../../ingestion/input-sources.md#iceberg-input-source) which enables ingestion of data stored in the Iceberg table format into Druid.
+
+Apache Iceberg is an open table format for huge analytic datasets. Even though iceberg manages most of its metadata in metadata files in the object storage, it is still dependent on a metastore for managing a certain amount of metadata.
+These metastores are defined as Iceberg catalogs and this extension supports connecting to the following catalog types:
+* Hive metastore catalog
+* Local catalog
+
+Support for AWS Glue and REST based catalogs are not available yet.
+
+For a given catalog, iceberg table name and filters, the IcebergInputSource works by reading the table from the catalog, applying the filters and extracting all the underlying live data files up to the latest snapshot.
+The data files are in either Parquet, ORC or Avro formats and all of these have InputFormat support in Druid. The data files typically reside in a warehouse location which could be in HDFS, S3 or the local filesystem.
+This extension relies on the existing InputSource connectors in Druid to read the data files from the warehouse. Therefore, the IcebergInputSource can be considered as an intermediate InputSource which provides the file paths for other InputSource implementations.
+
+### Load the Iceberg Ingest extension
+
+To use the iceberg extension, add the `druid-iceberg-extensions` to the list of loaded extensions. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
+
+
+### Hive Metastore catalog
+
+For Druid to seamlessly talk to the Hive Metastore, ensure that the Hive specific configuration files such as `hive-site.xml` and `core-site.xml` are available in the Druid classpath for peon processes.  
+Hive specific properties can also be specified under the `catalogProperties` object in the ingestion spec. 
+
+Hive metastore catalogs can be associated with different types of warehouses, but this extension presently only supports HDFS and S3 warehouse directories.
+
+#### Reading from HDFS warehouse 
+
+Ensure that the extension `druid-hdfs-storage` is loaded. The data file paths are extracted from the Hive metastore catalog and [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `hdfs`.
+
+If the Hive metastore supports Kerberos authentication, the following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "principal": "krb_principal",
+  "keytab": "/path/to/keytab"
+}
+```
+Only Kerberos based authentication is supported as of now.
+
+#### Reading from S3 warehouse
+
+Ensure that the extension `druid-s3-extensions` is loaded. The data file paths are extracted from the Hive metastore catalog and `S3InputSource` is used to ingest these files.
+The `warehouseSource` type in the ingestion spec should be `s3`. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, the following properties are required in the `warehouseSource` section to define the S3 endpoint settings:
+
+```json
+"warehouseSource": {
+  "type": "s3",
+  "endpointConfig": {
+    "url": "S3_ENDPOINT_URL",
+    "signingRegion": "us-east-1"
+  },
+  "clientConfig": {
+    "protocol": "http",
+    "disableChunkedEncoding": true,
+    "enablePathStyleAccess": true,
+    "forceGlobalBucketAccessEnabled": false
+  },
+  "properties": {
+    "accessKeyId": {
+      "type": "default",
+      "password": "<ACCESS_KEY_ID"
+    },
+    "secretAccessKey": {
+      "type": "default",
+      "password": "<SECRET_ACCESS_KEY>"
+    }
+  }
+}
+```
+
+This extension uses the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/) to connect to S3 and retrieve the metadata and data file paths.
+The following properties will be required in the `catalogProperties`:
+
+```json
+"catalogProperties": {
+  "fs.s3a.access.key" : "S3_ACCESS_KEY",
+  "fs.s3a.secret.key" : "S3_SECRET_KEY",
+  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+}
+```
+Since the Hadoop AWS connector uses the `s3a` filesystem based client, the warehouse path should be specified with the `s3a://` protocol instead of `s3://`.
+
+### Local Catalog
+
+The local catalog type can be used for catalogs configured on the local filesystem. The `icebergCatalog` type should be set as `local`. This catalog is useful for demos or localized tests and is not recommended for production use cases.
+This catalog only supports reading from a local filesystem and so the `warehouseSource` is defined as `local`.
+
+### Known limitations
+
+This extension does not presently fully utilize the iceberg features such as snapshotting or schema evolution. Following are the current limitations of this extension:

Review Comment:
   ```suggestion
   This section lists the known limitations that apply to the `druid-iceberg-extensions` extension.
   
   - This extension does not fully utilize the Iceberg features such as snapshotting or schema evolution. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260529401


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.
+
+`equals` Filter:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `equals`.|yes|

Review Comment:
   ```suggestion
   |`type`|Set this value to `equals`.|Yes|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260528678


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `hive`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogUri|The URI associated with the hive catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+IcebergFilter Object:
+
+This input source provides filters: `and` , `equals` , `interval` and `or`. These filters can be used to filter out data files from a snapshot, thereby reducing the number of files Druid has to ingest.

Review Comment:
   ```suggestion
   This input source provides the following filters: `and`, `equals`, `interval`, and `or`. You can use these filters to filter out data files from a snapshot, reducing the number of files Druid has to ingest.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260523632


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|
+
+Hive Catalog:

Review Comment:
   ```suggestion
   The following table lists the properties of a `hive` catalog:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1260526232


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "iceberg",
+        "tableName": "iceberg_table",
+        "namespace": "iceberg_namespace",
+        "icebergCatalog": {
+            "type": "hive",
+            "warehousePath": "hdfs://warehouse/path",
+            "catalogUri": "thrift://hive-metastore.x.com:8970",
+            "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "catalog_test",
+                "hadoop.security.authentication": "kerberos",
+                "hadoop.security.authorization": "true"
+            }
+        },
+        "icebergFilter": {
+            "type": "interval",
+            "filterColumn": "event_time",
+            "intervals": [
+              "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+            ]
+        },
+        "warehouseSource": {
+            "type": "hdfs"
+        }
+      },
+      "inputFormat": {
+        "type": "parquet"
+      }
+  },
+      ...
+},
+...
+```
+
+```json
+...
+        "ioConfig": {
+          "type": "index_parallel",
+          "inputSource": {
+            "type": "iceberg",
+            "tableName": "iceberg_table",
+            "namespace": "iceberg_namespace",
+            "icebergCatalog": {
+              "type": "hive",
+              "warehousePath": "hdfs://warehouse/path",
+              "catalogUri": "thrift://hive-metastore.x.com:8970",
+              "catalogProperties": {
+                "hive.metastore.connect.retries": "1",
+                "hive.metastore.execute.setugi": "false",
+                "hive.metastore.kerberos.principal": "KRB_PRINCIPAL",
+                "hive.metastore.sasl.enabled": "true",
+                "metastore.catalog.default": "default_catalog",
+                "fs.s3a.access.key" : "S3_ACCESS_KEY",
+                "fs.s3a.secret.key" : "S3_SECRET_KEY",
+                "fs.s3a.endpoint" : "S3_API_ENDPOINT"
+              }
+            },
+            "icebergFilter": {
+              "type": "interval",
+              "filterColumn": "event_time",
+              "intervals": [
+                "2023-05-10T19:00:00.000Z/2023-05-10T20:00:00.000Z"
+              ]
+            },
+            "warehouseSource": {
+              "type": "s3",
+              "endpointConfig": {
+                "url": "teststore.aws.com",
+                "signingRegion": "us-west-2a"
+              },
+              "clientConfig": {
+                "protocol": "http",
+                "disableChunkedEncoding": true,
+                "enablePathStyleAccess": true,
+                "forceGlobalBucketAccessEnabled": false
+              },
+              "properties": {
+                "accessKeyId": {
+                  "type": "default",
+                  "password": "foo"
+                },
+                "secretAccessKey": {
+                  "type": "default",
+                  "password": "bar"
+                }
+              },
+            }
+          },
+          "inputFormat": {
+            "type": "parquet"
+          }
+        },
+...
+},
+```
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `iceberg`.|yes|
+|tableName|The iceberg table name configured in the catalog.|yes|
+|namespace|The iceberg namespace associated with the table|yes|
+|icebergFilter|JSON Object used to filter data files within a snapshot when reading|no|
+|icebergCatalog|JSON Object used to define the catalog that manages the configured iceberg table|yes|
+|warehouseSource|JSON Object used to indicate which native input source needs to be used to read the data files from the warehouse|yes|
+
+Catalog Object:
+
+There are two supported catalog types: `local` and `hive`
+
+Local catalog:
+
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set this value to `local`.|yes|
+|warehousePath|The location of the warehouse associated with the catalog|yes|
+|catalogProperties|Map of any additional properties that needs to be attached to the catalog|no|

Review Comment:
   ```suggestion
   |`catalogProperties`|Map of any additional properties that needs to be attached to the catalog.|No|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] github-code-scanning[bot] commented on a diff in pull request #14329: Extension to read and ingest iceberg data files

Posted by "github-code-scanning[bot] (via GitHub)" <gi...@apache.org>.

github-code-scanning[bot] commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1203154116


##########
extensions-contrib/druid-iceberg-extensions/src/test/java/org/apache/druid/iceberg/filter/IcebergAndFilterTest.java:
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import org.apache.druid.java.util.common.Intervals;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+
+import java.util.Arrays;
+import java.util.Collections;
+
+public class IcebergAndFilterTest
+{
+  @Rule
+  public ExpectedException expectedException = ExpectedException.none();

Review Comment:
   ## Deprecated method or constructor invocation
   
   Invoking [ExpectedException.none](1) should be avoided because it has been deprecated.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4979)



##########
extensions-contrib/druid-iceberg-extensions/src/test/java/org/apache/druid/iceberg/filter/IcebergOrFilterTest.java:
##########
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import org.apache.druid.java.util.common.Intervals;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+
+import java.util.Arrays;
+import java.util.Collections;
+
+public class IcebergOrFilterTest
+{
+  @Rule
+  public ExpectedException expectedException = ExpectedException.none();

Review Comment:
   ## Deprecated method or constructor invocation
   
   Invoking [ExpectedException.none](1) should be avoided because it has been deprecated.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4980)



##########
processing/src/test/java/org/apache/druid/data/input/impl/LocalInputSourceAdapterTest.java:
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.data.input.impl;
+
+import org.apache.druid.data.input.InputFormat;
+import org.apache.druid.data.input.InputRowSchema;
+import org.apache.druid.data.input.InputSplit;
+import org.apache.druid.data.input.SplitHintSpec;
+import org.apache.druid.java.util.common.ISE;
+import org.easymock.EasyMock;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+import org.junit.rules.TemporaryFolder;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class LocalInputSourceAdapterTest
+{
+  @Rule
+  public ExpectedException expectedException = ExpectedException.none();

Review Comment:
   ## Deprecated method or constructor invocation
   
   Invoking [ExpectedException.none](1) should be avoided because it has been deprecated.
   
   [Show more details](https://github.com/apache/druid/security/code-scanning/4981)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "a2l007 (via GitHub)" <gi...@apache.org>.

a2l007 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1255048172


##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/filter/IcebergIntervalFilter.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.iceberg.filter;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.google.common.base.Preconditions;
+import org.apache.iceberg.TableScan;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Types;
+import org.joda.time.Interval;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class IcebergIntervalFilter implements IcebergFilter
+{
+  @JsonProperty
+  private final String filterColumn;
+
+  @JsonProperty
+  private final List<Interval> intervals;
+
+  @JsonCreator
+  public IcebergIntervalFilter(
+      @JsonProperty("filterColumn") String filterColumn,
+      @JsonProperty("intervals") List<Interval> intervals
+  )
+  {
+    Preconditions.checkNotNull(filterColumn, "filterColumn can not be null");
+    Preconditions.checkNotNull(intervals, "intervals can not be null");
+    this.filterColumn = filterColumn;
+    this.intervals = intervals;
+  }
+
+  @Override
+  public TableScan filter(TableScan tableScan)
+  {
+    return tableScan.filter(getFilterExpression());
+  }
+
+  @Override
+  public Expression getFilterExpression()
+  {
+    List<Expression> expressions = new ArrayList<>();
+    for (Interval filterInterval : intervals) {
+      Long dateStart = (long) Literal.of(filterInterval.getStart().toString())
+                                     .to(Types.TimestampType.withZone())
+                                     .value();
+      Long dateEnd = (long) Literal.of(filterInterval.getEnd().toString())
+                                   .to(Types.TimestampType.withZone())
+                                   .value();
+
+      expressions.add(Expressions.and(
+          Expressions.greaterThanOrEqual(
+              filterColumn,
+              dateStart
+          ),
+          Expressions.lessThan(
+              filterColumn,
+              dateEnd
+          )
+      ));
+    }
+    Expression finalExpr = Expressions.alwaysFalse();

Review Comment:
   This is just setting a default for the final expression. Do you see any issues here?



##########
extensions-contrib/druid-iceberg-extensions/pom.xml:
##########
@@ -0,0 +1,313 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+
+  <groupId>org.apache.druid.extensions</groupId>
+  <artifactId>druid-iceberg-extensions</artifactId>
+  <name>druid-iceberg-extensions</name>
+  <description>druid-iceberg-extensions</description>
+
+  <parent>
+    <artifactId>druid</artifactId>
+    <groupId>org.apache.druid</groupId>
+    <version>27.0.0-SNAPSHOT</version>
+    <relativePath>../../pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <properties>
+  </properties>
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>${hadoop.compile.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>io.netty</groupId>
+          <artifactId>netty-buffer</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-cli</groupId>
+          <artifactId>commons-cli</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>log4j</groupId>
+          <artifactId>log4j</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-codec</groupId>
+          <artifactId>commons-codec</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-logging</groupId>
+          <artifactId>commons-logging</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-io</groupId>
+          <artifactId>commons-io</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-lang</groupId>
+          <artifactId>commons-lang</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpclient</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.httpcomponents</groupId>
+          <artifactId>httpcore</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.zookeeper</groupId>
+          <artifactId>zookeeper</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-log4j12</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>javax.ws.rs</groupId>
+          <artifactId>jsr311-api</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.code.findbugs</groupId>
+          <artifactId>jsr305</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty-util</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.protobuf</groupId>
+          <artifactId>protobuf-java</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.curator</groupId>
+          <artifactId>curator-client</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.commons</groupId>
+          <artifactId>commons-math3</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.google.guava</groupId>
+          <artifactId>guava</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.avro</groupId>
+          <artifactId>avro</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>net.java.dev.jets3t</groupId>
+          <artifactId>jets3t</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-json</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.jcraft</groupId>
+          <artifactId>jsch</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mortbay.jetty</groupId>
+          <artifactId>jetty</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.sun.jersey</groupId>
+          <artifactId>jersey-server</artifactId>
+        </exclusion>
+        <!-- Following are excluded to remove security vulnerabilities: -->
+        <exclusion>
+          <groupId>commons-beanutils</groupId>
+          <artifactId>commons-beanutils-core</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.iceberg</groupId>
+      <artifactId>iceberg-spark-runtime-3.3_2.12</artifactId>
+      <version>1.0.0</version>
+      <exclusions>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hive</groupId>
+      <artifactId>hive-metastore</artifactId>
+      <version>3.1.3</version>
+      <exclusions>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-hdfs</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.apache.hbase</groupId>
+          <artifactId>hbase-client</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.druid</groupId>
+      <artifactId>druid-processing</artifactId>
+      <version>${project.parent.version}</version>
+      <scope>provided</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.slf4j</groupId>
+          <artifactId>slf4j-api</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-annotations</artifactId>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-databind</artifactId>
+      <scope>provided</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-hdfs-client</artifactId>
+      <scope>runtime</scope>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-mapreduce-client-core</artifactId>

Review Comment:
   We only need this dependency for the Hive catalog to work. It might work with the uber shaded jars but the dependencies that come along with it are not needed for this extension to work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [PR] Extension to read and ingest iceberg data files (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.

abhishekagarwal87 commented on code in PR #14329:
URL: https://github.com/apache/druid/pull/14329#discussion_r1255248904


##########
docs/ingestion/input-sources.md:
##########
@@ -794,6 +794,194 @@ The following is an example of a Combining input source spec:
 ...
 ```
 
+## Iceberg input source
+
+> You need to include the `druid-iceberg-extensions` as an extension to use the Iceberg input source.
+
+The Iceberg input source is used to read data stored in the Iceberg table format. For a given table, this input source scans up to the latest iceberg snapshot from the configured Hive catalog and the underlying live data files will be ingested using the existing input source formats available in Druid.
+
+The Iceberg input source cannot be independent as it relies on the existing input sources to perform the actual read from the Data files.
+For example, if the warehouse associated with an iceberg catalog is on `S3`, please ensure that the [`druid-s3-extensions`](../development/extensions-core/s3.md) extension is also loaded.

Review Comment:
   Got it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org