You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Mike Thomsen <mi...@gmail.com> on 2022/10/25 17:33:05 UTC

Need to deprecate the deduplication functionality in the MongoDB GridFS processors

The hash-based deduplication strategy used the built-in "md5"
attribute to offload the work to the database. That functionality was
deprecated and AFAICT gone as of Mongo 5:

https://www.mongodb.com/docs/manual/core/gridfs/#files.md5

I am proposing two changes:

* Remove deduplication
* Create a MongoDB DistributedMapCache client that can query on the
file metadata since GridFS stores metadata separately from chunks
making lookups that way cheap and flexible.

I could easily add that to this PR which already covers Testcontainers
integration, making it super easy to test the changed behavior:

https://github.com/apache/nifi/pull/6460

Thoughts?

Re: Need to deprecate the deduplication functionality in the MongoDB GridFS processors

Posted by Mike Thomsen <mi...@gmail.com>.
I might have to look into it more then because I wasn't able to get it
to work on MongoDB 5.X. It might just be something like the Docker
container is set up with better defaults since "sane defaults" or the
lack thereof was historically something Mongo got a lot of flack for
not having.

On Thu, Oct 27, 2022 at 9:36 PM David Handermann
<ex...@apache.org> wrote:
>
> Mike,
>
> Thanks for raising this issue for additional discussion. According to the
> MongoDB document referenced, the md5 option is deprecated, but not yet
> removed:
>
> > The MD5 algorithm is prohibited by FIPS 140-2. MongoDB drivers deprecate
> MD5 support and will remove MD5 generation in future releases. Applications
> that require a file digest should implement it outside of GridFS and store
> in files.metadata
> <https://www.mongodb.com/docs/manual/core/gridfs/#mongodb-data-files.metadata>
>
> There is a configuration option called disableMD5, but it still appears to
> be part of the GridFS specification. Were you able to confirm that it
> breaks in MongoDB 5 or 6?
>
> I agree that we should be able to address this behavior in the current
> version of NiFi, and it seems like having a transitional way forward would
> be helpful. If the Testcontainers change can verify the current MD5
> functionality, that should provide a good baseline for a subsequent PR to
> implement a new hashing strategy.
>
> Regards,
> David Handermann
>
> On Tue, Oct 25, 2022 at 1:36 PM Mike Thomsen <mi...@gmail.com> wrote:
>
> > As-is, the deduplication-by-hash functionality appears to now be
> > broken w/ Mongo 5 and higher. We can address that by doing some
> > updates to the code base and recommending users add a HashContent
> > processor before PutGridFS, but flows are going to break either way
> > thanks to changes in Mongo itself. That's why I'm not sure we should
> > be dogmatic about waiting.
> >
> > On Tue, Oct 25, 2022 at 2:15 PM Pierre Villard
> > <pi...@gmail.com> wrote:
> > >
> > > IMO we should start working on NiFi 2.0 going forward and it sounds like
> > a
> > > good opportunity to make such changes in our components.
> > >
> > >
> > > Le mar. 25 oct. 2022 à 19:33, Mike Thomsen <mi...@gmail.com> a
> > > écrit :
> > >
> > > > The hash-based deduplication strategy used the built-in "md5"
> > > > attribute to offload the work to the database. That functionality was
> > > > deprecated and AFAICT gone as of Mongo 5:
> > > >
> > > > https://www.mongodb.com/docs/manual/core/gridfs/#files.md5
> > > >
> > > > I am proposing two changes:
> > > >
> > > > * Remove deduplication
> > > > * Create a MongoDB DistributedMapCache client that can query on the
> > > > file metadata since GridFS stores metadata separately from chunks
> > > > making lookups that way cheap and flexible.
> > > >
> > > > I could easily add that to this PR which already covers Testcontainers
> > > > integration, making it super easy to test the changed behavior:
> > > >
> > > > https://github.com/apache/nifi/pull/6460
> > > >
> > > > Thoughts?
> > > >
> >

Re: Need to deprecate the deduplication functionality in the MongoDB GridFS processors

Posted by David Handermann <ex...@apache.org>.
Mike,

Thanks for raising this issue for additional discussion. According to the
MongoDB document referenced, the md5 option is deprecated, but not yet
removed:

> The MD5 algorithm is prohibited by FIPS 140-2. MongoDB drivers deprecate
MD5 support and will remove MD5 generation in future releases. Applications
that require a file digest should implement it outside of GridFS and store
in files.metadata
<https://www.mongodb.com/docs/manual/core/gridfs/#mongodb-data-files.metadata>

There is a configuration option called disableMD5, but it still appears to
be part of the GridFS specification. Were you able to confirm that it
breaks in MongoDB 5 or 6?

I agree that we should be able to address this behavior in the current
version of NiFi, and it seems like having a transitional way forward would
be helpful. If the Testcontainers change can verify the current MD5
functionality, that should provide a good baseline for a subsequent PR to
implement a new hashing strategy.

Regards,
David Handermann

On Tue, Oct 25, 2022 at 1:36 PM Mike Thomsen <mi...@gmail.com> wrote:

> As-is, the deduplication-by-hash functionality appears to now be
> broken w/ Mongo 5 and higher. We can address that by doing some
> updates to the code base and recommending users add a HashContent
> processor before PutGridFS, but flows are going to break either way
> thanks to changes in Mongo itself. That's why I'm not sure we should
> be dogmatic about waiting.
>
> On Tue, Oct 25, 2022 at 2:15 PM Pierre Villard
> <pi...@gmail.com> wrote:
> >
> > IMO we should start working on NiFi 2.0 going forward and it sounds like
> a
> > good opportunity to make such changes in our components.
> >
> >
> > Le mar. 25 oct. 2022 à 19:33, Mike Thomsen <mi...@gmail.com> a
> > écrit :
> >
> > > The hash-based deduplication strategy used the built-in "md5"
> > > attribute to offload the work to the database. That functionality was
> > > deprecated and AFAICT gone as of Mongo 5:
> > >
> > > https://www.mongodb.com/docs/manual/core/gridfs/#files.md5
> > >
> > > I am proposing two changes:
> > >
> > > * Remove deduplication
> > > * Create a MongoDB DistributedMapCache client that can query on the
> > > file metadata since GridFS stores metadata separately from chunks
> > > making lookups that way cheap and flexible.
> > >
> > > I could easily add that to this PR which already covers Testcontainers
> > > integration, making it super easy to test the changed behavior:
> > >
> > > https://github.com/apache/nifi/pull/6460
> > >
> > > Thoughts?
> > >
>

Re: Need to deprecate the deduplication functionality in the MongoDB GridFS processors

Posted by Mike Thomsen <mi...@gmail.com>.
As-is, the deduplication-by-hash functionality appears to now be
broken w/ Mongo 5 and higher. We can address that by doing some
updates to the code base and recommending users add a HashContent
processor before PutGridFS, but flows are going to break either way
thanks to changes in Mongo itself. That's why I'm not sure we should
be dogmatic about waiting.

On Tue, Oct 25, 2022 at 2:15 PM Pierre Villard
<pi...@gmail.com> wrote:
>
> IMO we should start working on NiFi 2.0 going forward and it sounds like a
> good opportunity to make such changes in our components.
>
>
> Le mar. 25 oct. 2022 à 19:33, Mike Thomsen <mi...@gmail.com> a
> écrit :
>
> > The hash-based deduplication strategy used the built-in "md5"
> > attribute to offload the work to the database. That functionality was
> > deprecated and AFAICT gone as of Mongo 5:
> >
> > https://www.mongodb.com/docs/manual/core/gridfs/#files.md5
> >
> > I am proposing two changes:
> >
> > * Remove deduplication
> > * Create a MongoDB DistributedMapCache client that can query on the
> > file metadata since GridFS stores metadata separately from chunks
> > making lookups that way cheap and flexible.
> >
> > I could easily add that to this PR which already covers Testcontainers
> > integration, making it super easy to test the changed behavior:
> >
> > https://github.com/apache/nifi/pull/6460
> >
> > Thoughts?
> >

Re: Need to deprecate the deduplication functionality in the MongoDB GridFS processors

Posted by Pierre Villard <pi...@gmail.com>.
IMO we should start working on NiFi 2.0 going forward and it sounds like a
good opportunity to make such changes in our components.


Le mar. 25 oct. 2022 à 19:33, Mike Thomsen <mi...@gmail.com> a
écrit :

> The hash-based deduplication strategy used the built-in "md5"
> attribute to offload the work to the database. That functionality was
> deprecated and AFAICT gone as of Mongo 5:
>
> https://www.mongodb.com/docs/manual/core/gridfs/#files.md5
>
> I am proposing two changes:
>
> * Remove deduplication
> * Create a MongoDB DistributedMapCache client that can query on the
> file metadata since GridFS stores metadata separately from chunks
> making lookups that way cheap and flexible.
>
> I could easily add that to this PR which already covers Testcontainers
> integration, making it super easy to test the changed behavior:
>
> https://github.com/apache/nifi/pull/6460
>
> Thoughts?
>