You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@nifi.apache.org by "Tamas Palfy (Jira)" <ji...@apache.org> on 2021/12/02 14:10:00 UTC

[jira] [Created] (NIFI-9436) PutParquet, PutORC processors may hang after writing to ADLS

Tamas Palfy created NIFI-9436:
---------------------------------

Summary: PutParquet, PutORC processors may hang after writing to ADLS
Key: NIFI-9436
URL: https://issues.apache.org/jira/browse/NIFI-9436
Project: Apache NiFi
Issue Type: Bug
Reporter: Tamas Palfy

h2. Background
In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an *org.apache.parquet.hadoop.ParquetWriter* writer is _created_, used to write record to an HDFS location and is later _closed_ explicitly.

The writer creation process involves the instantiation of an org.apache.hadoop.fs.FileSystem object, which, when writing to ADLS is going to be an org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.
Note that the NiFi AbstractPutHDFSRecord processor already created a FileSystem object of the same type for it's own purposes but the writer creates it's own.

The writer only uses the FileSystem object to create an AbfsOutputStream object and doesn't keep the FileSystem object itself.
This makes the FileSystem object eligible to garbage collection.

The AbfsOutputStream writes data asynchronously. Submits the task to an executorservice and stores it in a collection.

h2. The issue

* The AzureBlobFileSystem and the AbfsOutputStream both have reference to the same ThreadPoolExecutor object.
* AzureBlobFileSystem -probably depending on the version- overrides the finalize() method and closes itself when that is called. This involves shutting down the referenced ThreadPoolExecutor.
* It's possible for garbage collection to occur after the ParquetWriter is created but before explicitly closing it. GC -> AzureBlobFileSystem.finalize() -> ThreadPoolExecutor.shutdown().
* When the ParquetWriter is explicitly closed it tries to run a cleanup job using the ThreadPoolExecutor. That job submission fails as the ThreadPoolExecutor is already terminated but a Future object is still created - and is being wait for indefinitely.

This causes the processor to hang.

h2. The solution

This feels like an issue that should be addressed in the hadoop-azure library but it's possible to apply a workaround in NiFi.

The problem starts by the AzureBlobFileSystem getting garbage collected. So if the ParquetWriter used the same FileSystem object that the processor already created for itself -and kept the reference for- it would prevent the garbage collection to occur.

--
This message was sent by Atlassian Jira
(v8.20.1#820001)