You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/12/02 19:25:00 UTC
[jira] [Commented] (NIFI-9436) PutParquet, PutORC processors may hang after writing to ADLS

    [ https://issues.apache.org/jira/browse/NIFI-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452570#comment-17452570 ] 

ASF subversion and git services commented on NIFI-9436:
-------------------------------------------------------

Commit ff864266f59e70a67b2a1f2c787a0f74464b6a9d in nifi's branch refs/heads/main from Tamas Palfy
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=ff86426 ]

NIFI-9436 - In AbstractPutHDFSRecord make sure the record writers use the FileSystem object the processor already has.

Signed-off-by: Matthew Burgess <ma...@apache.org>

This closes #5565


> PutParquet, PutORC processors may hang after writing to ADLS
> ------------------------------------------------------------
>
>                 Key: NIFI-9436
>                 URL: https://issues.apache.org/jira/browse/NIFI-9436
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Tamas Palfy
>            Assignee: Tamas Palfy
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h2. Background
> In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an *org.apache.parquet.hadoop.ParquetWriter* writer is {_}created{_}, used to write record to an HDFS location and is later _closed_ explicitly.
> The writer creation process involves the instantiation of an *org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is going to be an {*}org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem{*}.
> Note that the NiFi AbstractPutHDFSRecord processor already created a FileSystem object of the same type for it's own purposes but the writer creates it's own.
> It could still be the same if caching was enabled but it is explicitly disabled in *AbstractHadoopProcessor* (the parent of _AbstractPutHDFSRecord_).
> The writer only uses the FileSystem object to create an *org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't keep the FileSystem object itself.
> This makes the FileSystem object eligible for garbage collection.
> The AbfsOutputStream writes data asynchronously. Submits the task to an executorservice and stores it in a collection.
> h2. The issue
>  * The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference to the same _ThreadPoolExecutor_ object.
>  * _AzureBlobFileSystem_ (probably depending on the version) overrides the _finalize()_ method and closes itself when that is called. This involves shutting down the referenced {_}ThreadPoolExecutor{_}.
>  * It's possible for garbage collection to occur after the _ParquetWriter_ is created but before explicitly closing it. GC -> _AzureBlobFileSystem.finalize()_ -> {_}ThreadPoolExecutor.shutdown(){_}.
>  * When the _ParquetWriter_ is explicitly closed it tries to run a cleanup job using the {_}ThreadPoolExecutor{_}. That job submission fails as the _ThreadPoolExecutor_ is already terminated but a _Future_ object is still created - and is being wait for indefinitely.
> This causes the processor to hang.
> h2. The solution
> This feels like an issue that should be addressed in the _hadoop-azure_ library but it's possible to apply a workaround in NiFi.
> The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So if the _ParquetWriter_ used the same _FileSystem_ object that the processor already created for itself (and kept the reference for) it would prevent the garbage collection to occur.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)