You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hop.apache.org by Matt Casters <ma...@neo4j.com.INVALID> on 2021/02/17 10:29:33 UTC

Apache VFS integration

Dear Hoppiverse,

Apache VFS is a very useful tool for supporting alternative ways of storing
files.
We have a plugin type for VFS plugins as well as for allowing us to extend
functionality.
Where we're not doing a great job is in supporting specific driver features
like authentication where this is needed.
For example, Amazon AWS needs an access key and a secret key (among a list
of authentication options).

Usually the configuration of these access APIs underpinning the VFS drivers
is left as standard as possible.
However, this causes it to be dragged outside of the Hop-isphere in the
sense that you have effects like the following:
- variables can't be used.
- obfuscation/encryption of keys and passwords is not done
- No GUI or configuration elements to configure the security
- No metadata wrappers are available or supported
...

To get past these shortcomings I propose to turn HopVfs from a singleton
into a proper class.
Class HopVfs is used about 800 times in the source code.  It's a wrapper
around the Apache VFS API.

In the ideal case we would have a new metadata object type called something
like :
"Amazon Web Services Authentication" which would have fields like access
and secret key but also perhaps a checkbox: [x] configured by the system.
We could then have a global configuration option for the S3 VFS driver
which simply says which "AWS Auth" object to use.

This would address many if not all concerns.  The "only" thing we need to
change is moving from:

HopVfs.getFileObject(filename)

to

new HopVfs(metadataProvider, variables).getFileObject(filename)

Obviously a case can be made to cache the HopVfs objects at various
locations (one per pipeline, workflow, ...).
You get the idea.

I have a feeling that if we don't do this we'll continue to carry around
this architectural debt for quite a while more into other territories like
Azure and Google.

Thoughts?

Matt
-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com
☎  +32486972937

Re: Apache VFS integration

Posted by Matt Casters <ma...@neo4j.com.INVALID>.
Well that's why we need these discussions since it's not at all easy to
come up with an architecture that covers all usecases.

There are also wider implications for the ideas pitched so far.

Perhaps the old idea of having File Location as a metadata type wasn't so
bad after all.  We could have a different resolver type for variable
expressions.

The File Location could have its own plugin type like with a relational
database connection.

The ability to have different forms of expressions beyond constant values,
field  and variables is something that also keeps coming back. I'm afraid
that perhaps we'd be circling around the core issue which is that the
current architecture is not very flexible in this respect.

So all in all I would opt to slowly push towards the original roadmap item
which was included in the Apache Hop incubator proposal to start with
pluggable expressions.  To recap here is the idea:

Situation now in the GUI:
- You can have visual indicators that you can specify a field name coming
from a previous transform or even a previous workflow result row.  This is
usally in the form of a combo box.
- You can have a visual indicator that you are allowed to specify a
variable expression. This is in the form of a small $ symbol to the right
of the input widget.
- In all other scenarios we assume you can just enter a fixed value.
- Metadata elements are considered a special case when in fact metadata
wise a simple string is kept around in the metadata from a simple combo
box...

In the code this is handled in a variety of ways.  There are assumptions
being made and sometimes metadata is loaded lazily or immediately.  It's
hard to find a lot of consistency.

Pluggable expressions:  Any input field can be a HopExpression which is
aware of the type of expression it wants to represent.  It's also aware of
the list of possible options based on the available input and the available
plugins.
When it comes time to resolve the actual value of an expression we can have
a resolve() method with all possible arguments from the current execution
context including IVariables, the current row (IRowMeta, Object[]),
HopMetadataProvider and what not.

One of these plugin types could then resolve a file, execute a bit of Java,
JavaScript or Python. It could also be to calculate the result of an
Informatica or DataStage script/expression.

For the GUI we'd replace the $ symbol and replace ctrl-space with something
more intelligent and we would be improving life for a lot of folks I'm sure.

Anyway, these ideas need to go around a bit more and we'd have to take some
time to prototype.  Perhaps it's best to wait until we did a few more
releases?

Cheers,
Matt


Op zo 21 feb. 2021 10:10 schreef Hans Van Akelyen <
hans.van.akelyen@gmail.com>:

> Hi All,
>
> One of our main goals is to make configuration as simple as possible. I am
> all in to create a metadata object for VFS, I agree with Nicolas that a VFS
> object could be like we do our database connections, pick a type and
> specific option fields become visible to define the connection.
>
> I also think we could go a bit further than only saving the global
> settings/authentication.
>
> For S3 the general idea I have would be to create a vfs object named
> "test-bucket" it would contain the root path to s3 and how to authenticate,
> be it the system variables, a keyfile,... In our input transforms we could
> then get files using this named object and an added path.
> "vfs:test-bucket:/path/to/file.csv" (format subjected to change).
>
> This also adds the ability to use multiple different buckets/accounts with
> different keys/locations making Hop even more flexible.
>
> Let me know if this idea is totally bananas or a possible path to follow.
>
> Cheers,
> Hans
>
>
> On Wed, Feb 17, 2021 at 8:25 PM Nicolas ADMENT <na...@gmail.com> wrote:
>
> >  Dear Hop thinker,
> >
> > I think we should create a new type of metadata to share OAUTH, VFS, FTP,
> > SSH, IMAP, POP3, AS400, SAP... connection settings that we could reuse in
> > different plugins.
> > This new type of metadata object could be called something like
> > "IConnection" with a name, protocol type, and custom properties depending
> > on protocol type (host/login/password, Secret ID/Key, ....), much like
> > databases.
> >
> > For VFS, the connection name must be the same as the protocol scheme of
> the
> > URL used.
> > To configure an amazon connection, you must define the predefined name S3
> > and select the "Amazon Web Services Authentication" protocol, then set
> > authentication the properties.
> >
> > Some web based protocols may have common properties like proxy settings:
> >
> > We could be inspired by:
> > https://www.visualcron.com/doc/html/connection-visualcron.html
> >
> > Maybe it will give some idea.
> >
> > Regards,
> > Nicolas
> >
> >
> >
> >
> <#m_-6113234865807150246_m_-2555803484072009614_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >
> > Le mer. 17 févr. 2021 à 11:29, Matt Casters <matt.casters@neo4j.com
> > .invalid>
> > a écrit :
> >
> > > Dear Hoppiverse,
> > >
> > > Apache VFS is a very useful tool for supporting alternative ways of
> > storing
> > > files.
> > > We have a plugin type for VFS plugins as well as for allowing us to
> > extend
> > > functionality.
> > > Where we're not doing a great job is in supporting specific driver
> > features
> > > like authentication where this is needed.
> > > For example, Amazon AWS needs an access key and a secret key (among a
> > list
> > > of authentication options).
> > >
> > > Usually the configuration of these access APIs underpinning the VFS
> > drivers
> > > is left as standard as possible.
> > > However, this causes it to be dragged outside of the Hop-isphere in the
> > > sense that you have effects like the following:
> > > - variables can't be used.
> > > - obfuscation/encryption of keys and passwords is not done
> > > - No GUI or configuration elements to configure the security
> > > - No metadata wrappers are available or supported
> > > ...
> > >
> > > To get past these shortcomings I propose to turn HopVfs from a
> singleton
> > > into a proper class.
> > > Class HopVfs is used about 800 times in the source code.  It's a
> wrapper
> > > around the Apache VFS API.
> > >
> > > In the ideal case we would have a new metadata object type called
> > something
> > > like :
> > > "Amazon Web Services Authentication" which would have fields like
> access
> > > and secret key but also perhaps a checkbox: [x] configured by the
> system.
> > > We could then have a global configuration option for the S3 VFS driver
> > > which simply says which "AWS Auth" object to use.
> > >
> > > This would address many if not all concerns.  The "only" thing we need
> to
> > > change is moving from:
> > >
> > > HopVfs.getFileObject(filename)
> > >
> > > to
> > >
> > > new HopVfs(metadataProvider, variables).getFileObject(filename)
> > >
> > > Obviously a case can be made to cache the HopVfs objects at various
> > > locations (one per pipeline, workflow, ...).
> > > You get the idea.
> > >
> > > I have a feeling that if we don't do this we'll continue to carry
> around
> > > this architectural debt for quite a while more into other territories
> > like
> > > Azure and Google.
> > >
> > > Thoughts?
> > >
> > > Matt
> > > --
> > > Neo4j Chief Solutions Architect
> > > *✉   *matt.casters@neo4j.com
> > > ☎  +32486972937
> > >
> >
> > <
> >
> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> > >
> > Garanti
> > sans virus. www.avast.com
> > <
> >
> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> > >
> > <#m_-6113234865807150246_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >
>

Re: Apache VFS integration

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi All,

One of our main goals is to make configuration as simple as possible. I am
all in to create a metadata object for VFS, I agree with Nicolas that a VFS
object could be like we do our database connections, pick a type and
specific option fields become visible to define the connection.

I also think we could go a bit further than only saving the global
settings/authentication.

For S3 the general idea I have would be to create a vfs object named
"test-bucket" it would contain the root path to s3 and how to authenticate,
be it the system variables, a keyfile,... In our input transforms we could
then get files using this named object and an added path.
"vfs:test-bucket:/path/to/file.csv" (format subjected to change).

This also adds the ability to use multiple different buckets/accounts with
different keys/locations making Hop even more flexible.

Let me know if this idea is totally bananas or a possible path to follow.

Cheers,
Hans


On Wed, Feb 17, 2021 at 8:25 PM Nicolas ADMENT <na...@gmail.com> wrote:

>  Dear Hop thinker,
>
> I think we should create a new type of metadata to share OAUTH, VFS, FTP,
> SSH, IMAP, POP3, AS400, SAP... connection settings that we could reuse in
> different plugins.
> This new type of metadata object could be called something like
> "IConnection" with a name, protocol type, and custom properties depending
> on protocol type (host/login/password, Secret ID/Key, ....), much like
> databases.
>
> For VFS, the connection name must be the same as the protocol scheme of the
> URL used.
> To configure an amazon connection, you must define the predefined name S3
> and select the "Amazon Web Services Authentication" protocol, then set
> authentication the properties.
>
> Some web based protocols may have common properties like proxy settings:
>
> We could be inspired by:
> https://www.visualcron.com/doc/html/connection-visualcron.html
>
> Maybe it will give some idea.
>
> Regards,
> Nicolas
>
>
>
> <#m_-6113234865807150246_m_-2555803484072009614_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> Le mer. 17 févr. 2021 à 11:29, Matt Casters <matt.casters@neo4j.com
> .invalid>
> a écrit :
>
> > Dear Hoppiverse,
> >
> > Apache VFS is a very useful tool for supporting alternative ways of
> storing
> > files.
> > We have a plugin type for VFS plugins as well as for allowing us to
> extend
> > functionality.
> > Where we're not doing a great job is in supporting specific driver
> features
> > like authentication where this is needed.
> > For example, Amazon AWS needs an access key and a secret key (among a
> list
> > of authentication options).
> >
> > Usually the configuration of these access APIs underpinning the VFS
> drivers
> > is left as standard as possible.
> > However, this causes it to be dragged outside of the Hop-isphere in the
> > sense that you have effects like the following:
> > - variables can't be used.
> > - obfuscation/encryption of keys and passwords is not done
> > - No GUI or configuration elements to configure the security
> > - No metadata wrappers are available or supported
> > ...
> >
> > To get past these shortcomings I propose to turn HopVfs from a singleton
> > into a proper class.
> > Class HopVfs is used about 800 times in the source code.  It's a wrapper
> > around the Apache VFS API.
> >
> > In the ideal case we would have a new metadata object type called
> something
> > like :
> > "Amazon Web Services Authentication" which would have fields like access
> > and secret key but also perhaps a checkbox: [x] configured by the system.
> > We could then have a global configuration option for the S3 VFS driver
> > which simply says which "AWS Auth" object to use.
> >
> > This would address many if not all concerns.  The "only" thing we need to
> > change is moving from:
> >
> > HopVfs.getFileObject(filename)
> >
> > to
> >
> > new HopVfs(metadataProvider, variables).getFileObject(filename)
> >
> > Obviously a case can be made to cache the HopVfs objects at various
> > locations (one per pipeline, workflow, ...).
> > You get the idea.
> >
> > I have a feeling that if we don't do this we'll continue to carry around
> > this architectural debt for quite a while more into other territories
> like
> > Azure and Google.
> >
> > Thoughts?
> >
> > Matt
> > --
> > Neo4j Chief Solutions Architect
> > *✉   *matt.casters@neo4j.com
> > ☎  +32486972937
> >
>
> <
> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> >
> Garanti
> sans virus. www.avast.com
> <
> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> >
> <#m_-6113234865807150246_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>

Re: Apache VFS integration

Posted by Nicolas ADMENT <na...@gmail.com>.
 Dear Hop thinker,

I think we should create a new type of metadata to share OAUTH, VFS, FTP,
SSH, IMAP, POP3, AS400, SAP... connection settings that we could reuse in
different plugins.
This new type of metadata object could be called something like
"IConnection" with a name, protocol type, and custom properties depending
on protocol type (host/login/password, Secret ID/Key, ....), much like
databases.

For VFS, the connection name must be the same as the protocol scheme of the
URL used.
To configure an amazon connection, you must define the predefined name S3
and select the "Amazon Web Services Authentication" protocol, then set
authentication the properties.

Some web based protocols may have common properties like proxy settings:

We could be inspired by:
https://www.visualcron.com/doc/html/connection-visualcron.html

Maybe it will give some idea.

Regards,
Nicolas


<#m_-6113234865807150246_m_-2555803484072009614_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Le mer. 17 févr. 2021 à 11:29, Matt Casters <ma...@neo4j.com.invalid>
a écrit :

> Dear Hoppiverse,
>
> Apache VFS is a very useful tool for supporting alternative ways of storing
> files.
> We have a plugin type for VFS plugins as well as for allowing us to extend
> functionality.
> Where we're not doing a great job is in supporting specific driver features
> like authentication where this is needed.
> For example, Amazon AWS needs an access key and a secret key (among a list
> of authentication options).
>
> Usually the configuration of these access APIs underpinning the VFS drivers
> is left as standard as possible.
> However, this causes it to be dragged outside of the Hop-isphere in the
> sense that you have effects like the following:
> - variables can't be used.
> - obfuscation/encryption of keys and passwords is not done
> - No GUI or configuration elements to configure the security
> - No metadata wrappers are available or supported
> ...
>
> To get past these shortcomings I propose to turn HopVfs from a singleton
> into a proper class.
> Class HopVfs is used about 800 times in the source code.  It's a wrapper
> around the Apache VFS API.
>
> In the ideal case we would have a new metadata object type called something
> like :
> "Amazon Web Services Authentication" which would have fields like access
> and secret key but also perhaps a checkbox: [x] configured by the system.
> We could then have a global configuration option for the S3 VFS driver
> which simply says which "AWS Auth" object to use.
>
> This would address many if not all concerns.  The "only" thing we need to
> change is moving from:
>
> HopVfs.getFileObject(filename)
>
> to
>
> new HopVfs(metadataProvider, variables).getFileObject(filename)
>
> Obviously a case can be made to cache the HopVfs objects at various
> locations (one per pipeline, workflow, ...).
> You get the idea.
>
> I have a feeling that if we don't do this we'll continue to carry around
> this architectural debt for quite a while more into other territories like
> Azure and Google.
>
> Thoughts?
>
> Matt
> --
> Neo4j Chief Solutions Architect
> *✉   *matt.casters@neo4j.com
> ☎  +32486972937
>

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Garanti
sans virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#m_-6113234865807150246_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>