You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Andy Christianson <ai...@protonmail.com> on 2017/11/09 16:06:45 UTC

minifi-cpp: Expression language core API

MiNiFi - C++ Devs,

I am currently working on MINIFICPP-49, the expression language feature. While expression compilation and evaluation is fairly self-contained, at the very least the API to access expression evaluation will touch core components.

Here is how NiFi is currently exposing expression evaluation to processors:

    ...
    try {
                // read the url property from the context
                final String urlstr = trimToEmpty(context.getProperty(PROP_URL).evaluateAttributeExpressions(requestFlowFile).getValue());
                final URL url = new URL(urlstr);
    ...

While we have the opportunity now to improve this, we have a couple design constraints: the expression code comes from properties, and dynamic evaluation of it requires a flow file as input.

Because expressions are defined as processor properties, it is natural to expose expression evaluation via the ProcessContext API. The current minifi-cpp API to get static properties is as follows:

    bool getProperty(const std::string &name, std::string &value) {
      return processor_node_->getProperty(name, value);
    }

If we do not wish to introduce a Property type with its own evaluateAttributeExpressions method, we could simply introduce another ProcessContext method for evaluating dynamic properties:

    bool evaluateProperty(const std::string &name, const core::FlowFile &flow_file, std::string &value) {
      ...
    }

The implementation of this would compile the expression (the raw value as returned by getProperty(...)) if it has not yet been compiled, then evaluate the compiled expression against the provided FlowFile. The end result is an API similar to, but simpler than, the NiFi interface. The alternative is to provide the expression primitives to processors and allow them to manage compilation/evaluation on their own. This would increase complexity across all processors which support expression properties, which will likely be most processors.

The next important question which impacts core minifi is whether or not expression language should be an extension. Whether or not it is an extension, some kind of standard interface to expressions will need to be made available to all processors. Here are the pros/cons of putting it in an extension, as far as I can tell:

Pros:

- Reduce compiled size of minifi somewhat (the lexer/parser is currently ~4300 lines of C++ with no additional library or runtime dependencies) when feature is disabled
- Allow for alternate expression language implementations in the future

Cons:

- Additional complexity by needing to add Expression primitives, a standard Expression compiler API, dynamic object loading, and an empty (NoOp) implementation if the extension is not included
- Additional vtable lookups on an operation which will be invoked very frequently (every property lookup on every flow file which supports expressions)
- Makes it harder for gcc/clang/etc. to inline/optimize expression language functions
- Core processors (e.g. GetFile/PutFile, where expression language will almost certainly be desired for file paths and other properties) will depend on an optional extension

I would like to hear feedback from the dev community on these two important topics (the interface to the expression language and whether or not the implementation should be an extension) before writing the code that touches core components. The API question is ultimately more important because it touches all current and future processor authors. The decision of whether it is an extension or not is more reversible.

Regards,

Andy I.C.

Re: minifi-cpp: Expression language core API

Posted by Marc <ph...@apache.org>.
Hi Andy,

   Sorry for the delayed response.

   I like the idea of moving the capabilities out of core so that we can
decouple components for a variety of reasons that boil down to minimizing
work to break apart components in the future. I'm certainly not against
augmenting core to facilitate functions like EL, but in my opinion, we
should be able to do so without directly coupling the core with the concept
of expression language. My hope would be that an injected component ( this
may necessitate a more complex framework like google fruit ) can help
inject an expression language specific lambda. I think this mostly goes
inline with your idea that we have a way to plug in an arbitrary evaluation
mechanism, but instead, I'm saying we should inject a function ( in a
controlled way, through a custom configuration ) to execute the evaluation.

Adding a function argument to getProperty that includes the argument for
FlowFile indicates that the processor developer wants to get the property
in relation to the FlowFile. This shouldn't require a separate type of
property but rather a lambda that was provided by the extension to evaluate
the property in relation to the FlowFile's attributes/and or content. If
the expression language is disabled, replaced by another, or the property
is not an expression this does nothing other return the property value.  In
my mind you would never have the FlowFile in onSchedule and if you pass it
with your getProperty, the developer naturally wants any evaluation to
occur if that property is an evaluation ). If they don't want a property to
ever evaluate, they needn't pass the FlowFile pointer.

  What you mention for cons vary wildly on compiler version, processor
capabilities, etc. Branch prediction, code cache misses, virtual function
inlining and caches are topics better served for a discussion on
optimization not design. The code in [1] produces assembly output by where
we inline the virtual function without optimization( -O0)[2], even on a
portable arch setting in GCC; however, the same code produces a different
assembler output on gcc 4.5.3 with optimizations enabled [3]. If I run the
optimized code, the vtable segment is approximately two nanoseconds faster
than the inlined code because it results in smaller code for better branch
prediction and instruction cache hits. Perf stat shows this across 1M runs,
and in a more illustrative example, we can see how jumps aren't always
slower than inlining. A great example of this exists in last year's CPPCON
from Chandler Carruth. My point is not to argue, but rather to suggest that
cons such as this have such wildly varying characteristics it's akin to
suggesting anyone really knows what the Java Garbage collector is going to
do at any given point [4].

Nonetheless I'm suggesting we only inject and cache a function that's run
to evaluate. This certainly may require an extended FlowConfiguration and
changes to Property, ProcessorNode, and ProcessorContext, but the changes
should hopefully minimize EL specific code on the core so that if/when we
move certain parts to C for 'librarification' we can do so more easily.
Additionally, if there is ever the need to completely disable EL in
production, we can do so by changing the injected classes within the
binaries via C2.

Of course, these are only suggestions. I would eventually like to see
someone who is using libminifi as a library be able to inject this in the
same way, but not be required to include it if they have only the need to
do a very small task. For the same reason I've been slow to respond, I
unfortunately responded to this very quickly so if I didn't make any sense
please let me know [5].

[1]https://gist.github.com/phrocker/100a0806fbc188731358b659c7cf63ab

[2]

leaq -48(%rbp), %rax
addq $8, %rax
movl $.LC0, %esi
movq %rax, %rdi
call std::basic_string<char, std::char_traits<char>, std::allocator<char>
>::operator=(char const*)
leaq -48(%rbp), %rax
addq $8, %rax
movq %rax, %rdx
leaq -64(%rbp), %rax
movq %rdx, %rsi
movq %rax, %rdi
call std::basic_string<char, std::char_traits<char>, std::allocator<char>
>::basic_string(std::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)
leaq -64(%rbp), %rax
movq %rax, %rdi
call std::basic_string<char, std::char_traits<char>, std::allocator<char>
>::~basic_string()

[3]

subq $64, %rsp
leaq 16(%rsp), %rbx
movq vtable for Property2+16, 16(%rsp)

[4]
  haha.

[5]
  I usually don't.

On Thu, Nov 9, 2017 at 11:06 AM, Andy Christianson <ai...@protonmail.com>
wrote:

> MiNiFi - C++ Devs,
>
> I am currently working on MINIFICPP-49, the expression language feature.
> While expression compilation and evaluation is fairly self-contained, at
> the very least the API to access expression evaluation will touch core
> components.
>
> Here is how NiFi is currently exposing expression evaluation to processors:
>
>     ...
>     try {
>                 // read the url property from the context
>                 final String urlstr = trimToEmpty(context.getPropert
> y(PROP_URL).evaluateAttributeExpressions(requestFlowFile).getValue());
>                 final URL url = new URL(urlstr);
>     ...
>
> While we have the opportunity now to improve this, we have a couple design
> constraints: the expression code comes from properties, and dynamic
> evaluation of it requires a flow file as input.
>
> Because expressions are defined as processor properties, it is natural to
> expose expression evaluation via the ProcessContext API. The current
> minifi-cpp API to get static properties is as follows:
>
>     bool getProperty(const std::string &name, std::string &value) {
>       return processor_node_->getProperty(name, value);
>     }
>
> If we do not wish to introduce a Property type with its own
> evaluateAttributeExpressions method, we could simply introduce another
> ProcessContext method for evaluating dynamic properties:
>
>     bool evaluateProperty(const std::string &name, const core::FlowFile
> &flow_file, std::string &value) {
>       ...
>     }
>
> The implementation of this would compile the expression (the raw value as
> returned by getProperty(...)) if it has not yet been compiled, then
> evaluate the compiled expression against the provided FlowFile. The end
> result is an API similar to, but simpler than, the NiFi interface. The
> alternative is to provide the expression primitives to processors and allow
> them to manage compilation/evaluation on their own. This would increase
> complexity across all processors which support expression properties, which
> will likely be most processors.
>
> The next important question which impacts core minifi is whether or not
> expression language should be an extension. Whether or not it is an
> extension, some kind of standard interface to expressions will need to be
> made available to all processors. Here are the pros/cons of putting it in
> an extension, as far as I can tell:
>
> Pros:
>
> - Reduce compiled size of minifi somewhat (the lexer/parser is currently
> ~4300 lines of C++ with no additional library or runtime dependencies) when
> feature is disabled
> - Allow for alternate expression language implementations in the future
>
> Cons:
>
> - Additional complexity by needing to add Expression primitives, a
> standard Expression compiler API, dynamic object loading, and an empty
> (NoOp) implementation if the extension is not included
> - Additional vtable lookups on an operation which will be invoked very
> frequently (every property lookup on every flow file which supports
> expressions)
> - Makes it harder for gcc/clang/etc. to inline/optimize expression
> language functions
> - Core processors (e.g. GetFile/PutFile, where expression language will
> almost certainly be desired for file paths and other properties) will
> depend on an optional extension
>
> I would like to hear feedback from the dev community on these two
> important topics (the interface to the expression language and whether or
> not the implementation should be an extension) before writing the code that
> touches core components. The API question is ultimately more important
> because it touches all current and future processor authors. The decision
> of whether it is an extension or not is more reversible.
>
> Regards,
>
> Andy I.C.
>