You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by GitBox <gi...@apache.org> on 2020/04/05 17:09:15 UTC

[GitHub] [nifi] andrewglowacki opened a new pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

andrewglowacki opened a new pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184
 
 
   Thank you for submitting a contribution to Apache NiFi.
   
   Please provide a short description of the PR here:
   
   This PR provides an HDFS based content repository that uses the Hadoop FileSystem API to store FlowFile content.
   
   _Enables HDFS Content Repository implementation for NIFI-7320._
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
        in the commit message?
   
   - [x] Does your PR title start with **NIFI-XXXX** where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
   
   - [x] Has your PR been rebased against the latest commit within the target branch (typically `master`)?
   
   - [x] Is your initial contribution a single, squashed commit? _Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not `squash` or use `--force` when pushing to allow for clean monitoring of changes._
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via `mvn -Pcontrib-check clean install` at the root `nifi` folder?
   - [X] Have you written or updated unit tests to verify your changes?
   - [x] Have you verified that the full build is successful on both JDK 8 and JDK 11?
   - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [x] If applicable, have you updated the `LICENSE` file, including the main `LICENSE` file under `nifi-assembly`?
   - [x] If applicable, have you updated the `NOTICE` file, including the main `NOTICE` file found under `nifi-assembly`?
   - [x] If adding new Properties, have you added `.displayName` in addition to .name (programmatic access) for each of the new properties?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771851
 
 

 ##########
 File path: nifi-assembly/pom.xml
 ##########
 @@ -192,6 +192,12 @@ language governing permissions and limitations under the License. -->
             <artifactId>nifi-hadoop-nar</artifactId>
             <version>1.12.0-SNAPSHOT</version>
             <type>nar</type>
+    	</dependency>
+	<dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-hdfs-repository-nar</artifactId>
 
 Review comment:
   since this nar depends on the hadoop nar it might be fine/small enough to include.  But we'd have to be clear this is experimental.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771498
 
 

 ##########
 File path: nifi-nar-bundles/nifi-hdfs-repository-bundle/nifi-hdfs-content-repository/pom.xml
 ##########
 @@ -0,0 +1,104 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <!--
+      Licensed to the Apache Software Foundation (ASF) under one or more
+      contributor license agreements.  See the NOTICE file distributed with
+      this work for additional information regarding copyright ownership.
+      The ASF licenses this file to You under the Apache License, Version 2.0
+      (the "License"); you may not use this file except in compliance with
+      the License.  You may obtain a copy of the License at
+          http://www.apache.org/licenses/LICENSE-2.0
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+      See the License for the specific language governing permissions and
+      limitations under the License.
+    -->
+    <modelVersion>4.0.0</modelVersion>
+    <parent>
+        <groupId>org.apache.nifi</groupId>
+        <artifactId>nifi-hdfs-repository-bundle</artifactId>
+        <version>1.12.0-SNAPSHOT</version>
+    </parent>
+    <artifactId>nifi-hdfs-content-repository</artifactId>
+    <packaging>jar</packaging>
+
+    <dependencies>
+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-framework-api</artifactId>
+            <version>${project.version}</version>
 
 Review comment:
   Please use explicit versions such as 1.12.0-SNAPSHOT to be consistent with the rest of the codebase.  The release process takes care of automatically changing these and historically using these values has caused build/release issues.  Instead explicit values work well and are handled by the release plugin.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403773990
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
+This content repository uses the Hadoop FileSystem API to store FlowFile content. Because of this, it can be used to store content on the local disk and/or in one or more distinct HDFS clusters. It also has four different operating modes which are described below in the `nifi.content.repository.hdfs.operating.mode` property.
 
 Review comment:
   No, however I'm not really doing anything crazy with the Hadoop API. I unfortunately don't have access to a test cluster that would be able to exercise this well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403774121
 
 

 ##########
 File path: nifi-assembly/pom.xml
 ##########
 @@ -192,6 +192,12 @@ language governing permissions and limitations under the License. -->
             <artifactId>nifi-hadoop-nar</artifactId>
             <version>1.12.0-SNAPSHOT</version>
             <type>nar</type>
+    	</dependency>
+	<dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-hdfs-repository-nar</artifactId>
 
 Review comment:
   Sorry comments overlapped. It doesn't depend on the hadoop-libraries-nar because it has to depend on the nifi-framework-nar in order for it to work properly with the NAR ClassLoader.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r404128333
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
 
 Review comment:
   Moving this documentation to a README.txt in the bundle home directory.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771554
 
 

 ##########
 File path: nifi-nar-bundles/nifi-hdfs-repository-bundle/nifi-hdfs-content-repository/pom.xml
 ##########
 @@ -0,0 +1,104 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <!--
+      Licensed to the Apache Software Foundation (ASF) under one or more
+      contributor license agreements.  See the NOTICE file distributed with
+      this work for additional information regarding copyright ownership.
+      The ASF licenses this file to You under the Apache License, Version 2.0
+      (the "License"); you may not use this file except in compliance with
+      the License.  You may obtain a copy of the License at
+          http://www.apache.org/licenses/LICENSE-2.0
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+      See the License for the specific language governing permissions and
+      limitations under the License.
+    -->
+    <modelVersion>4.0.0</modelVersion>
+    <parent>
+        <groupId>org.apache.nifi</groupId>
+        <artifactId>nifi-hdfs-repository-bundle</artifactId>
+        <version>1.12.0-SNAPSHOT</version>
+    </parent>
+    <artifactId>nifi-hdfs-content-repository</artifactId>
+    <packaging>jar</packaging>
+
+    <dependencies>
+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-framework-api</artifactId>
+            <version>${project.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-framework-core-api</artifactId>
+            <version>${project.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-repository-models</artifactId>
+            <version>${project.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-common</artifactId>
+            <version>${hadoop.version}</version>
+            <exclusions>
+                <exclusion>
+                    <groupId>org.slf4j</groupId>
+                    <artifactId>slf4j-log4j12</artifactId>
+                </exclusion>
+            </exclusions>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-hdfs</artifactId>
+            <version>${hadoop.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-hdfs-client</artifactId>
+            <version>${hadoop.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>commons-io</groupId>
+            <artifactId>commons-io</artifactId>
+            <version>2.6</version>
 
 Review comment:
   is commons-io and lang 2.6 used because that is what the HDFS libs need/want?  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] dtrodrigues commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
dtrodrigues commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403785196
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
+This content repository uses the Hadoop FileSystem API to store FlowFile content. Because of this, it can be used to store content on the local disk and/or in one or more distinct HDFS clusters. It also has four different operating modes which are described below in the `nifi.content.repository.hdfs.operating.mode` property.
+
+All of the properties defined above (see <<file-system-content-repository-properties,File System Content Repository Properties>>) still apply. Only HDFS-specific properties are listed here. 
+
+The equivalent default local content repository directory would be specified with:
+`nifi.content.repository.directory.default=file:content_repository`
+
+An HDFS content repository directory would be specified with:
+`nifi.content.repository.directory.default=hdfs://localhost:9000/content_repository`
 
 Review comment:
   This URI scheme doesn't work with the current path parsing logic. Gives: 
   `    java.lang.IllegalArgumentException: Pathname /localhost:9000/content_repository from hdfs:/localhost:8020/content_repository is not a valid DFS filename.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771278
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
 
 Review comment:
   What are the security considerations of this new capability?  Does it support kerberos?  Does it support TDE/etc..?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] dtrodrigues commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
dtrodrigues commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403785196
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
+This content repository uses the Hadoop FileSystem API to store FlowFile content. Because of this, it can be used to store content on the local disk and/or in one or more distinct HDFS clusters. It also has four different operating modes which are described below in the `nifi.content.repository.hdfs.operating.mode` property.
+
+All of the properties defined above (see <<file-system-content-repository-properties,File System Content Repository Properties>>) still apply. Only HDFS-specific properties are listed here. 
+
+The equivalent default local content repository directory would be specified with:
+`nifi.content.repository.directory.default=file:content_repository`
+
+An HDFS content repository directory would be specified with:
+`nifi.content.repository.directory.default=hdfs://localhost:9000/content_repository`
 
 Review comment:
   This URI scheme doesn't work wih the current path parsing logic

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403772200
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
+This content repository uses the Hadoop FileSystem API to store FlowFile content. Because of this, it can be used to store content on the local disk and/or in one or more distinct HDFS clusters. It also has four different operating modes which are described below in the `nifi.content.repository.hdfs.operating.mode` property.
+
+All of the properties defined above (see <<file-system-content-repository-properties,File System Content Repository Properties>>) still apply. Only HDFS-specific properties are listed here. 
+
+The equivalent default local content repository directory would be specified with:
+`nifi.content.repository.directory.default=file:content_repository`
+
+An HDFS content repository directory would be specified with:
+`nifi.content.repository.directory.default=hdfs://localhost:9000/content_repository`
+
+Example Minimal Configuration:
+`nifi.content.repository.implementation=org.apache.nifi.hdfs.repository.HdfsContentRepository`
+`nifi.content.repository.hdfs.core.site=./conf/core-site.xml`
+
+|====
+|*Property*|*Description*
+|`nifi.content.repository.implementation`|Value should be: `org.apache.nifi.hdfs.repository.HdfsContentRepository`
+|`nifi.content.repository.hdfs.core.site`|The default Hadoop `core-site.xml` file to configure file systems with. +
+
+	*NOTE:* This isn't actually required as long as each location specifies its own core.site.xml, however each directory is required to have a `core-site.xml` defined either with this property, or as described below.
+
+	For example:
+	Assume the following two locations: +
+	`nifi.content.repository.directory.default1=uri://path/to/dir1` +
+	`nifi.content.repository.directory.default2=uri://path/to/dir2` +
+
+	Then the following two properties may also be provided: +
+	`nifi.content.repository.hdfs.core.site.default1=/path/to/core-site-1.xml` +
+	`nifi.content.repository.hdfs.core.site.default2=/path/to/core-site-2.xml`
+|`nifi.content.repository.hdfs.primary`|A comma separated list of location names to treat as the primary storage group for the `CapacityFallback` and `FailureFallback` operating modes. +
+
+Example: +
+`nifi.content.repository.hdfs.primary=disk1,disk2,disk3`
+|`nifi.content.repository.hdfs.archive`|A comma separated list of location names to store archived content in. See the `Archive` operating mode above. +
+
+Example: +
+`nifi.content.repository.hdfs.archive=archive1,archive2,archive3`
+|`nifi.content.repository.hdfs.operating.mode`|A comma separated list of operating modes that governs the behavior of the content repository. Default is `Normal`. +
+
+	The recognized modes and their behaviors are as follows: +
+
+	`Normal`: No special fallback handling is made during failure. Each configured location is written to as normal until they are full. Once all locations are full, writes will block until space becomes available. Note: This is default operating mode if one isn't specified in the `nifi.properties` file. +
+
+	`CapacityFallback`: The locations in the 'primary' group are filled first and the rest are only filled once all locations in the primary group are full. Once space becomes available again for at least a minute, the primary group will become active again. This mode cannot be used with the `FailureFallback` mode. The 'primary' group is specified with the following property where each location id is comma separated: `nifi.content.repository.hdfs.primary` +
+
+	`FailureFallback`: The configured locations 'primary' group are filled as normal until they are full. Once they are full, writes will block until space becomes available. If a write failure occurs within all primary locations, the remaining non-primary locations are written to until a configured time period has elapsed. This mode cannot be used with the `CapacityFallback` mode. +
+
+	`Archive`:  All locations are written to and filled as described in the other modes. As files are moved to the archive, they are copied to the locations in the 'archive' group and then deleted. This can be combined with any of the other three modes. If this is the only mode specified, `Normal` is also assumed. The 'archive' group is specified with the  `nifi.content.repository.hdfs.archive` property where each location id is comma separated.
+|`nifi.content.repository.hdfs.full.percentage`|The percentage ('##%') of a location's capacity that must be occupied before treating the location as 'full'. Note: Once a location is full, all writes will stop for that location. If all locations are full and there is no fallback, claim creation will stop until space becomes available. The default is `95%`.
+|`nifi.content.repository.hdfs.failure.timeout`|The amount of time to wait when a failure occurs for a location before attempting to use that location again for writing. Example value: `1 minute`
+|`nifi.content.repository.hdfs.wait.active.containers.timeout`|The amount of time to wait for an active location to be available before giving up and throwing an exception. Defaults to indefinite. Example value: `5 minutes`
+|`nifi.content.repository.hdfs.sections.per.container`|The number of subdirectories per location. Defaults to `1024`. This is primarily used to avoid too many content claim files within a single directory.
 
 Review comment:
   Which property are you referring to? The claims operate the same way the default FileSystemRepository work: each physical resource claim file on disk will contain one or more content claims.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403773557
 
 

 ##########
 File path: nifi-nar-bundles/nifi-hdfs-repository-bundle/nifi-hdfs-content-repository/pom.xml
 ##########
 @@ -0,0 +1,104 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <!--
+      Licensed to the Apache Software Foundation (ASF) under one or more
+      contributor license agreements.  See the NOTICE file distributed with
+      this work for additional information regarding copyright ownership.
+      The ASF licenses this file to You under the Apache License, Version 2.0
+      (the "License"); you may not use this file except in compliance with
+      the License.  You may obtain a copy of the License at
+          http://www.apache.org/licenses/LICENSE-2.0
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+      See the License for the specific language governing permissions and
+      limitations under the License.
+    -->
+    <modelVersion>4.0.0</modelVersion>
+    <parent>
+        <groupId>org.apache.nifi</groupId>
+        <artifactId>nifi-hdfs-repository-bundle</artifactId>
+        <version>1.12.0-SNAPSHOT</version>
+    </parent>
+    <artifactId>nifi-hdfs-content-repository</artifactId>
+    <packaging>jar</packaging>
+
+    <dependencies>
+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-framework-api</artifactId>
+            <version>${project.version}</version>
 
 Review comment:
   Will do.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r404126624
 
 

 ##########
 File path: nifi-assembly/pom.xml
 ##########
 @@ -192,6 +192,12 @@ language governing permissions and limitations under the License. -->
             <artifactId>nifi-hadoop-nar</artifactId>
             <version>1.12.0-SNAPSHOT</version>
             <type>nar</type>
+    	</dependency>
+	<dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-hdfs-repository-nar</artifactId>
 
 Review comment:
   The NAR is pretty big compared to most others: 53 MB where the average size is 23 MB. I'll add this to a new separate profile

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403773147
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
 
 Review comment:
   It is using the standard HDFS Hadoop Client API, so yes it would. I actually haven't it before, but it appears to be configured through the Hadoop core-site.xml which is the config file the repository (and other HDFS processors) requires.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403772266
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
 
 Review comment:
   Sounds, good. Are there examples of this anywhere you can point me to?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403773538
 
 

 ##########
 File path: nifi-nar-bundles/nifi-hdfs-repository-bundle/nifi-hdfs-content-repository/pom.xml
 ##########
 @@ -0,0 +1,104 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <!--
+      Licensed to the Apache Software Foundation (ASF) under one or more
+      contributor license agreements.  See the NOTICE file distributed with
+      this work for additional information regarding copyright ownership.
+      The ASF licenses this file to You under the Apache License, Version 2.0
+      (the "License"); you may not use this file except in compliance with
+      the License.  You may obtain a copy of the License at
+          http://www.apache.org/licenses/LICENSE-2.0
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+      See the License for the specific language governing permissions and
+      limitations under the License.
+    -->
+    <modelVersion>4.0.0</modelVersion>
+    <parent>
+        <groupId>org.apache.nifi</groupId>
+        <artifactId>nifi-hdfs-repository-bundle</artifactId>
+        <version>1.12.0-SNAPSHOT</version>
+    </parent>
+    <artifactId>nifi-hdfs-content-repository</artifactId>
+    <packaging>jar</packaging>
+
+    <dependencies>
+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-framework-api</artifactId>
+            <version>${project.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-framework-core-api</artifactId>
+            <version>${project.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-repository-models</artifactId>
+            <version>${project.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-common</artifactId>
+            <version>${hadoop.version}</version>
+            <exclusions>
+                <exclusion>
+                    <groupId>org.slf4j</groupId>
+                    <artifactId>slf4j-log4j12</artifactId>
+                </exclusion>
+            </exclusions>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-hdfs</artifactId>
+            <version>${hadoop.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-hdfs-client</artifactId>
+            <version>${hadoop.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>commons-io</groupId>
+            <artifactId>commons-io</artifactId>
+            <version>2.6</version>
 
 Review comment:
   I use commons-lang and commons-io briefly in a couple places. I'm not sure if Hadoop needs them or not. They are included in the nifi-hdfs-processors bundle which I modelled the pom off, and I just kept them in there.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771119
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
+This content repository uses the Hadoop FileSystem API to store FlowFile content. Because of this, it can be used to store content on the local disk and/or in one or more distinct HDFS clusters. It also has four different operating modes which are described below in the `nifi.content.repository.hdfs.operating.mode` property.
+
+All of the properties defined above (see <<file-system-content-repository-properties,File System Content Repository Properties>>) still apply. Only HDFS-specific properties are listed here. 
+
+The equivalent default local content repository directory would be specified with:
+`nifi.content.repository.directory.default=file:content_repository`
+
+An HDFS content repository directory would be specified with:
+`nifi.content.repository.directory.default=hdfs://localhost:9000/content_repository`
+
+Example Minimal Configuration:
+`nifi.content.repository.implementation=org.apache.nifi.hdfs.repository.HdfsContentRepository`
+`nifi.content.repository.hdfs.core.site=./conf/core-site.xml`
+
+|====
+|*Property*|*Description*
+|`nifi.content.repository.implementation`|Value should be: `org.apache.nifi.hdfs.repository.HdfsContentRepository`
+|`nifi.content.repository.hdfs.core.site`|The default Hadoop `core-site.xml` file to configure file systems with. +
+
+	*NOTE:* This isn't actually required as long as each location specifies its own core.site.xml, however each directory is required to have a `core-site.xml` defined either with this property, or as described below.
+
+	For example:
+	Assume the following two locations: +
+	`nifi.content.repository.directory.default1=uri://path/to/dir1` +
+	`nifi.content.repository.directory.default2=uri://path/to/dir2` +
+
+	Then the following two properties may also be provided: +
+	`nifi.content.repository.hdfs.core.site.default1=/path/to/core-site-1.xml` +
+	`nifi.content.repository.hdfs.core.site.default2=/path/to/core-site-2.xml`
+|`nifi.content.repository.hdfs.primary`|A comma separated list of location names to treat as the primary storage group for the `CapacityFallback` and `FailureFallback` operating modes. +
+
+Example: +
+`nifi.content.repository.hdfs.primary=disk1,disk2,disk3`
+|`nifi.content.repository.hdfs.archive`|A comma separated list of location names to store archived content in. See the `Archive` operating mode above. +
+
+Example: +
+`nifi.content.repository.hdfs.archive=archive1,archive2,archive3`
+|`nifi.content.repository.hdfs.operating.mode`|A comma separated list of operating modes that governs the behavior of the content repository. Default is `Normal`. +
+
+	The recognized modes and their behaviors are as follows: +
+
+	`Normal`: No special fallback handling is made during failure. Each configured location is written to as normal until they are full. Once all locations are full, writes will block until space becomes available. Note: This is default operating mode if one isn't specified in the `nifi.properties` file. +
+
+	`CapacityFallback`: The locations in the 'primary' group are filled first and the rest are only filled once all locations in the primary group are full. Once space becomes available again for at least a minute, the primary group will become active again. This mode cannot be used with the `FailureFallback` mode. The 'primary' group is specified with the following property where each location id is comma separated: `nifi.content.repository.hdfs.primary` +
+
+	`FailureFallback`: The configured locations 'primary' group are filled as normal until they are full. Once they are full, writes will block until space becomes available. If a write failure occurs within all primary locations, the remaining non-primary locations are written to until a configured time period has elapsed. This mode cannot be used with the `CapacityFallback` mode. +
+
+	`Archive`:  All locations are written to and filled as described in the other modes. As files are moved to the archive, they are copied to the locations in the 'archive' group and then deleted. This can be combined with any of the other three modes. If this is the only mode specified, `Normal` is also assumed. The 'archive' group is specified with the  `nifi.content.repository.hdfs.archive` property where each location id is comma separated.
+|`nifi.content.repository.hdfs.full.percentage`|The percentage ('##%') of a location's capacity that must be occupied before treating the location as 'full'. Note: Once a location is full, all writes will stop for that location. If all locations are full and there is no fallback, claim creation will stop until space becomes available. The default is `95%`.
+|`nifi.content.repository.hdfs.failure.timeout`|The amount of time to wait when a failure occurs for a location before attempting to use that location again for writing. Example value: `1 minute`
+|`nifi.content.repository.hdfs.wait.active.containers.timeout`|The amount of time to wait for an active location to be available before giving up and throwing an exception. Defaults to indefinite. Example value: `5 minutes`
+|`nifi.content.repository.hdfs.sections.per.container`|The number of subdirectories per location. Defaults to `1024`. This is primarily used to avoid too many content claim files within a single directory.
 
 Review comment:
   claims go to single files (not directories) unless that is done differently here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771894
 
 

 ##########
 File path: nifi-assembly/pom.xml
 ##########
 @@ -192,6 +192,12 @@ language governing permissions and limitations under the License. -->
             <artifactId>nifi-hadoop-nar</artifactId>
             <version>1.12.0-SNAPSHOT</version>
             <type>nar</type>
+    	</dependency>
+	<dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-hdfs-repository-nar</artifactId>
 
 Review comment:
   Fair enough - should I put it in a different profile, or just completely remove it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403772835
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
 
 Review comment:
   we dont have examples of external things I can point you to.  We have included things and been clear they're experimental.  This means they might not work.  We dont know how the behave.  And we might not keep them in subsequent release even. If this new nar is really small we can probably include it.  I've not built this myself to check

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771033
 
 

 ##########
 File path: nifi-assembly/pom.xml
 ##########
 @@ -192,6 +192,12 @@ language governing permissions and limitations under the License. -->
             <artifactId>nifi-hadoop-nar</artifactId>
             <version>1.12.0-SNAPSHOT</version>
             <type>nar</type>
+    	</dependency>
+	<dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-hdfs-repository-nar</artifactId>
 
 Review comment:
   This cannot be in the convenience binary at this point.  We cannot afford to add nars unless their size is trivial (we're already over capacity in terms of build size).  Instead there should be instructions for users that want to experiment with this on how to add it to their nifi installs. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on issue #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on issue #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#issuecomment-609882719
 
 
   I advise the following path:
   
   1. Create a separate github repo for this nar and provide instructions on how someone could place this into their nifi installation and instructions on how to get started.
   2. The kerberos stuff is a very big deal for this to be used for anything production related.  This means considerations on how to configure, ensure renewals, etc.. all matters.  A *lot* of effort has gone into the various Hadoop related processors for this.
   3. Conduct significant/tests on real HDFS clusters to provide information on performance observed and use cases utilized.  There are significant reasons why we've not pursued this previously. While there could be interesting results they need to exist before we're talking about merging this into the codebase.
   
   Clearly considerable effort has gone into this PR.  But a lot more will be required both by you as a contributor and the community as reviewers. The above path allows that to happen but it will take time.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] andrewglowacki closed pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
andrewglowacki closed pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771344
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
+This content repository uses the Hadoop FileSystem API to store FlowFile content. Because of this, it can be used to store content on the local disk and/or in one or more distinct HDFS clusters. It also has four different operating modes which are described below in the `nifi.content.repository.hdfs.operating.mode` property.
 
 Review comment:
   Have any performance tests been done on this and if so for what kind of use cases/what performance?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
joewitt commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403771235
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
 
 Review comment:
   it should be highlighted this is a new and experimental implementation.  Also it should not be included in the default convenience binaries at this point so these docs should be elsewhere such as a readme or something like that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] dtrodrigues commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
dtrodrigues commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403779595
 
 

 ##########
 File path: nifi-nar-bundles/nifi-hdfs-repository-bundle/nifi-hdfs-content-repository/src/main/java/org/apache/nifi/hdfs/repository/ContainerGroup.java
 ##########
 @@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.hdfs.repository;
+
+import static org.apache.nifi.hdfs.repository.HdfsContentRepository.CORE_SITE_DEFAULT_PROPERTY;
+
+import java.io.File;
+import java.io.IOException;
+import java.net.URI;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.TimeoutException;
+import java.util.Set;
+import java.util.TreeMap;
+
+import org.apache.commons.lang.StringUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FsStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DFSConfigKeys;
+import org.apache.nifi.util.NiFiProperties;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ContainerGroup implements Iterable<Container> {
+    private static final Logger LOG = LoggerFactory.getLogger(HdfsContentRepository.class);
+    private final Map<String, Container> byName;
+    private final List<Container> all;
+    private final int numContainers;
+
+    /**
+     * Creates a new container group based on the specified existing containers.
+     */
+    public ContainerGroup(Collection<Container> containers) {
+        Map<String, Container> byName = new HashMap<>();
+        for (Container container : containers) {
+            byName.put(container.getName(), container);
+        }
+        this.byName = byName;
+        this.all = new ArrayList<>(new TreeMap<String, Container>(byName).values());
+        this.numContainers = this.all.size();
+    }
+
+    /**
+     * Creates a group of containers as they are defined in the nifi properties
+     * whose ids match the ones specified. This also ensures each container is
+     * properly configured and the default directory structure is present.
+     */
+    public ContainerGroup(NiFiProperties properties, RepositoryConfig repoConfig, Set<String> include,
+            Set<String> exclude) {
+        Configuration defaultHdfsConfig = null;
+        if (properties.getProperty(HdfsContentRepository.CORE_SITE_DEFAULT_PROPERTY) != null) {
+            defaultHdfsConfig = new Configuration();
+            defaultHdfsConfig.addResource(new Path(
+                    verifyExists(properties.getProperty(CORE_SITE_DEFAULT_PROPERTY), CORE_SITE_DEFAULT_PROPERTY)));
+        }
+
+        if (include != null) {
+            include = new HashSet<>(include);
+        }
+
+        Map<String, Container> byName = new HashMap<>();
+        for (Entry<String, java.nio.file.Path> entry : properties.getContentRepositoryPaths().entrySet()) {
+            String name = entry.getKey();
+            if (include != null && !include.contains(name)) {
+                continue;
+            } else if (exclude != null && exclude.contains(name)) {
+                continue;
+            }
+
+            Configuration config = defaultHdfsConfig;
+            String coreSitePath = properties.getProperty(CORE_SITE_DEFAULT_PROPERTY + "." + name);
+            if (coreSitePath != null) {
+                config = new Configuration();
+                config.addResource(new Path(verifyExists(coreSitePath, CORE_SITE_DEFAULT_PROPERTY + "." + name)));
+            } else if (defaultHdfsConfig == null) {
+                throw new RuntimeException("No core.site.xml defined for content repository container with name: "
 
 Review comment:
   core-site.xml

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [nifi] dtrodrigues commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.

Posted by GitBox <gi...@apache.org>.
dtrodrigues commented on a change in pull request #4184: NIFI-7320 Adding HDFS Content Repository implementation.
URL: https://github.com/apache/nifi/pull/4184#discussion_r403779488
 
 

 ##########
 File path: nifi-docs/src/main/asciidoc/administration-guide.adoc
 ##########
 @@ -2743,6 +2743,64 @@ nifi.content.repository.encryption.key.id=Key1
 nifi.content.repository.encryption.key=0123456789ABCDEFFEDCBA98765432100123456789ABCDEFFEDCBA9876543210
 ....
 
+
+=== HDFS Content Repository Properties
+
+This content repository uses the Hadoop FileSystem API to store FlowFile content. Because of this, it can be used to store content on the local disk and/or in one or more distinct HDFS clusters. It also has four different operating modes which are described below in the `nifi.content.repository.hdfs.operating.mode` property.
+
+All of the properties defined above (see <<file-system-content-repository-properties,File System Content Repository Properties>>) still apply. Only HDFS-specific properties are listed here. 
+
+The equivalent default local content repository directory would be specified with:
+`nifi.content.repository.directory.default=file:content_repository`
+
+An HDFS content repository directory would be specified with:
+`nifi.content.repository.directory.default=hdfs://localhost:9000/content_repository`
+
+Example Minimal Configuration:
+`nifi.content.repository.implementation=org.apache.nifi.hdfs.repository.HdfsContentRepository`
+`nifi.content.repository.hdfs.core.site=./conf/core-site.xml`
+
+|====
+|*Property*|*Description*
+|`nifi.content.repository.implementation`|Value should be: `org.apache.nifi.hdfs.repository.HdfsContentRepository`
+|`nifi.content.repository.hdfs.core.site`|The default Hadoop `core-site.xml` file to configure file systems with. +
+
+	*NOTE:* This isn't actually required as long as each location specifies its own core.site.xml, however each directory is required to have a `core-site.xml` defined either with this property, or as described below.
 
 Review comment:
   core-site.xml

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services