You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Tamas Palfy (Jira)" <ji...@apache.org> on 2023/12/13 20:08:00 UTC

[jira] [Commented] (NIFI-9464) Provenance Events files corrupted

    [ https://issues.apache.org/jira/browse/NIFI-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796442#comment-17796442 ] 

Tamas Palfy commented on NIFI-9464:
-----------------------------------

[~markap14] I think I found the root cause of this issue but couldn't confirm it 100% due to the complexity of the provenance repository bundle and because it's very hard to reproduce the issue. Maybe you can help with it and review my pull request?

The provenance journal files are regularly compressed on dedicated threads. When it is done the content of the uncompressed .prov file is compressed and written into a .prov.gz file, after which the .prov file gets deleted. Also, the corresponding .toc file is updated but in a different way. The new .toc file is named as .toc.tmp and after the .prov file is deleted the .toc.tmp is renamed to .toc.

The SiteToSiteProvenanceReportingTask of course runs on a Timer-Driven thread. It is important that when it references a compressed .prov.gz it also properly references the new .toc file.

However a race condition can occur and it's possible for the Time-Driven thread to run when the .prov file is deleted but the .toc.tmp file is not yet renamed to .toc. The original .toc file still exists though so it is paired with the .prov.gz. That can lead to the reported error.

This wasn't always the case. Both places use an EventFileManager to acquire locks properly.
However [PR|https://github.com/apache/nifi/pull/1686] for NIFI-3594 introduced a change in WriteAheadProvenanceRepository after which the 2 places started using a different EventFileManager object.

This is the important part:
{code:java}
    @Override
    public synchronized void initialize(final EventReporter eventReporter, final Authorizer authorizer, final ProvenanceAuthorizableFactory resourceFactory, final IdentifierLookup idLookup) throws IOException {
        ....
        final EventFileManager fileManager = new EventFileManager();
        final RecordReaderFactory recordReaderFactory = (file, logs, maxChars) -> {
            fileManager.obtainReadLock(file);
            try {
                return RecordReaders.newRecordReader(file, logs, maxChars);
            } finally {
                fileManager.releaseReadLock(file);
            }
        };

       init(recordWriterFactory, recordReaderFactory, eventReporter, authorizer, resourceFactory);
    }

    synchronized void init(RecordWriterFactory recordWriterFactory, RecordReaderFactory recordReaderFactory,
                           final EventReporter eventReporter, final Authorizer authorizer,
                           final ProvenanceAuthorizableFactory resourceFactory) throws IOException {
        final EventFileManager fileManager = new EventFileManager();
        ...
    }
{code}

My assessment is that this change (that the EventFileManager instances should be different) is unintentional but the aforementioned change is too complex for me to be able to tell 100%.

I'm going to open a pull request according to my assessment.

> Provenance Events files corrupted
> ---------------------------------
>
>                 Key: NIFI-9464
>                 URL: https://issues.apache.org/jira/browse/NIFI-9464
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.11.0, 1.15.0
>         Environment: java 11, centos 7, nifi standalone
>            Reporter: Wiktor Kubicki
>            Assignee: Tamas Palfy
>            Priority: Minor
>
> In my logs i found:
> {code:java}
> SiteToSiteProvenanceReportingTask[id=b209c0ae-016e-1000-ae39-301c9dcfc544] Failed to retrieve Provenance Events from repository due to: Attempted to skip to byte offset 9149491 for 1125432890.prov.gz but file does not have that many bytes (TOC Reader=StandardTocReader[file=/..../provenance_repository/toc/1125432890.toc, compressed=false]): java.io.EOFException: Attempted to skip to byte offset 9149491 for 1125432890.prov.gz but file does not have that many bytes (TOC Reader=StandardTocReader[file=/.../provenance_repository/toc/1125432890.toc, compressed=false])
> {code}
> It is criticaly important for me to have 100% sure of my logs. It happened about 100 times in last 1 year for 15 *.prov.gz files:
> {code:java}
> -rw-rw-rw-. 1 user user 1013923 Oct 17 21:17 1075441276.prov.gz
> -rw-rw-rw-. 1 user user 1345431 Oct 24 13:06 1083362251.prov.gz
> -rw-rw-rw-. 1 user user 1359282 Oct 25 13:07 1084546392.prov.gz
> -rw-rw-rw-. 1 user user 1155791 Nov  2 17:08 1094516954.prov.gz
> -rw-rw-r--. 1 user user  974136 Nov 18 22:07 1113402183.prov.gz
> -rw-rw-r--. 1 user user 1125608 Nov 28 22:00 1125097576.prov.gz
> -rw-rw-r--. 1 user user 1248319 Nov 29 04:30 1125432890.prov.gz
> -rw-rw-r--. 1 user user  832120 Feb  2  2021 661957813.prov.gz
> -rw-rw-r--. 1 user user 1110978 Mar 17  2021 734807613.prov.gz
> -rw-rw-r--. 1 user user 1506819 Apr 16  2021 786154249.prov.gz
> -rw-rw-r--. 1 user user 1763198 May 25  2021 852626782.prov.gz
> -rw-rw-r--. 1 user user 1580598 Jun 15 08:32 891934274.prov.gz
> -rw-rw-r--. 1 user user 2960296 Jun 28 17:07 917991812.prov.gz
> -rw-rw-r--. 1 user user 1808037 Jun 28 17:37 918051650.prov.gz
> -rw-rw-rw-. 1 user user  765924 Aug 14 13:09 991505484.prov.gz
> {code}
> BTW it's interesting why thera ere different chmods
> My config for provenance (BTW if you see posibbility for tune it, please tell me):
> {code:java}
> nifi.provenance.repository.directory.default=/....../provenance_repository
> nifi.provenance.repository.max.storage.time=730 days
> nifi.provenance.repository.max.storage.size=512 GB
> nifi.provenance.repository.rollover.time=10 mins
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=1
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=1 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=1
> nifi.provenance.repository.buffer.size=100000
> {code}
> Now my provenance repo has 140GB of data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)