You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by Valentyn Kahamlyk <va...@bitquilltech.com.INVALID> on 2022/06/30 23:35:45 UTC

Design proposal to use Arrow Flight as transport for Gremlin Server

Hello Everyone,

I would like to propose exploring options to use Arrow Flight as a
transport for Gremlin Server. Currently Gremlin Server and Clients are
based on WebSockets with a custom sub-protocol and serialization to
GraphSON and GraphBinary.  Developers for each driver must implement
those protocols from scratch and there is a limited amount of code
which is being reused (only 3rd party WebSocket libraries are
currently reused in the client variants). The protocol implementation
is a complicated and error-prone process, so most drivers only support
some subset of Gremlin Server features. The maintenance cost is also
constantly increasing with the number of new client variants being
added to TinkerPop.

** Motivation **
We would like to propose a solution to reduce maintenance and simplify
the development of the client drivers by using a standard protocol
based on the Apache Arrow Flight. As Arrow Flight is implemented in
the most common languages like C++, C#, Java and Python we anticipate
a larger amount of existing codebase can be reused which would help to
reduce maintenance costs in the future. Also, we can reuse some other
Arrow Flight features like authentication and error handling.

** Assumptions **
Proof of Concept Development will be done with Java 8.
Need to reuse existing code as much as possible.
It is desirable, but not necessary, to maintain compatibility with
existing drivers.
To simplify development at the initial stage, we will reuse existing
serialization mechanisms.

** Requirements **
Gremlin Server and drivers should replace the network layer with Arrow Flight.
No significant drop in performance.
Gremlin Arrow must pass the Gherkin test suite.

** Prototype Design Overview **
We would like to explore solution below and create prototype to prove
approach is feasible.
The main idea is to replace the transport layer with FlightServer and
FlightClient. They support asynchronous data transfer, splitting data
into chunks, and authorization. While Arrow Flight typically requires
schema, in a short term we can proceed with implementation using
existing serializers and GraphBinary format. By using GraphBinary we
will not have all capabilities that Arrow Flight provides out of the
box, like efficient compression. However, in the future, we see the
value of adding capabilities to generate a schema from the
server-side, and that can enable additional use cases.

First stage: replace transport layer, but keep serializers
Pros:
Reduction of the code base to be developed and maintained
A relatively low number of modifications

Cons:
We may observe reduced performance due to schema transfer and other
overhead. As part of the PoC we will assess performance overhead for
small and large responses and identify options to mitigate it.
Still need to support GraphBinary serialization.

Second stage: replace transport layer, make dynamic schema generation
and use native Arrow structures for data transmission
Pros:
Greater reduction of the codebase to be developed and maintained
In addition, need to rework the serialization and add schema generation
Performance can be improved for large data sets due to Arrow Flight
optimizations and the ability to transfer data in parallel
No need to support GraphBinary and GraphSON serialization protocols

Cons:
Reduced performance for small result sets
Can be complicated and expensive to generate a schema for each request

Please find few more diagrams attached in the pdf file attached and
please share your thoughts.

Regards, Valentyn

Re: Design proposal to use Arrow Flight as transport for Gremlin Server

Posted by Valentyn Kahamlyk <va...@bitquilltech.com.INVALID>.
Hello all,

The Gremlin Arrow Flight proof of concept demo will take place in the
TinkerPop Discord channel
https://discord.gg/renSpn8K?event=1006205553749545070 on Friday, Aug
12, 10:30am PST/1:30pm ET. Feel free to join us if you are interested!
Discord registration/login may be required.


On Fri, Aug 5, 2022 at 3:31 PM Valentyn Kahamlyk <va...@bitquilltech.com>
wrote:

> Hello all,I'm hosting in Discord a short demo for proof of concept using
> Arrow Flight with Gremlin, using string queries and GraphSON for
> serialization. Any questions and comments are welcome. The next step will
> be to create the full designs based on the proof of concept.The planned
> date is Aug 12, I will follow up with the exact time later.
>
> On Thu, Jun 30, 2022 at 4:35 PM Valentyn Kahamlyk <
> valentynk@bitquilltech.com> wrote:
>
>> Hello Everyone,
>>
>> I would like to propose exploring options to use Arrow Flight as a transport for Gremlin Server. Currently Gremlin Server and Clients are based on WebSockets with a custom sub-protocol and serialization to GraphSON and GraphBinary.  Developers for each driver must implement those protocols from scratch and there is a limited amount of code which is being reused (only 3rd party WebSocket libraries are currently reused in the client variants). The protocol implementation is a complicated and error-prone process, so most drivers only support some subset of Gremlin Server features. The maintenance cost is also constantly increasing with the number of new client variants being added to TinkerPop.
>>
>> ** Motivation **
>> We would like to propose a solution to reduce maintenance and simplify the development of the client drivers by using a standard protocol based on the Apache Arrow Flight. As Arrow Flight is implemented in the most common languages like C++, C#, Java and Python we anticipate a larger amount of existing codebase can be reused which would help to reduce maintenance costs in the future. Also, we can reuse some other Arrow Flight features like authentication and error handling.
>>
>> ** Assumptions **
>> Proof of Concept Development will be done with Java 8.
>> Need to reuse existing code as much as possible.
>> It is desirable, but not necessary, to maintain compatibility with existing drivers.
>> To simplify development at the initial stage, we will reuse existing serialization mechanisms.
>>
>> ** Requirements **
>> Gremlin Server and drivers should replace the network layer with Arrow Flight.
>> No significant drop in performance.
>> Gremlin Arrow must pass the Gherkin test suite.
>>
>> ** Prototype Design Overview **
>> We would like to explore solution below and create prototype to prove approach is feasible.
>> The main idea is to replace the transport layer with FlightServer and FlightClient. They support asynchronous data transfer, splitting data into chunks, and authorization. While Arrow Flight typically requires schema, in a short term we can proceed with implementation using existing serializers and GraphBinary format. By using GraphBinary we will not have all capabilities that Arrow Flight provides out of the box, like efficient compression. However, in the future, we see the value of adding capabilities to generate a schema from the server-side, and that can enable additional use cases.
>>
>> First stage: replace transport layer, but keep serializers
>> Pros:
>> Reduction of the code base to be developed and maintained
>> A relatively low number of modifications
>>
>> Cons:
>> We may observe reduced performance due to schema transfer and other overhead. As part of the PoC we will assess performance overhead for small and large responses and identify options to mitigate it.
>> Still need to support GraphBinary serialization.
>>
>> Second stage: replace transport layer, make dynamic schema generation and use native Arrow structures for data transmission
>> Pros:
>> Greater reduction of the codebase to be developed and maintained
>> In addition, need to rework the serialization and add schema generation
>> Performance can be improved for large data sets due to Arrow Flight optimizations and the ability to transfer data in parallel
>> No need to support GraphBinary and GraphSON serialization protocols
>>
>> Cons:
>> Reduced performance for small result sets
>> Can be complicated and expensive to generate a schema for each request
>>
>> Please find few more diagrams attached in the pdf file attached and please share your thoughts.
>>
>> Regards, Valentyn
>>
>>

Re: Design proposal to use Arrow Flight as transport for Gremlin Server

Posted by Lyndon Bauto <lb...@aerospike.com>.
Looking forward to this, I will try to attend Aug 12.

If Arrow Flight does not work out, I think Neo4J's Bolt could be a strong
alternative to graph binary.

Advantages Bolt has:
- No max content length issues
- Error handling and recovery built into protocol
- Handshaking to provide compatibility across client / server wire protocol
versions
- Transactions are very well defined within Bolt

There are some others, but those come to mind. See https://7687.org/ for
more info.

This also fits the ideology of building bridges, not walls (
https://ieeexplore.ieee.org/document/9031506) as it aligns our wire
protocol with Neo4J's wire protocol.

Because of that, we may get good driver re-use out of this. I haven't
looked into the licensing of any of this though so that may not be possible.

Lyndon

On Fri, Aug 5, 2022 at 3:31 PM Valentyn Kahamlyk
<va...@bitquilltech.com.invalid> wrote:

> Hello all,I'm hosting in Discord a short demo for proof of concept using
> Arrow Flight with Gremlin, using string queries and GraphSON for
> serialization. Any questions and comments are welcome. The next step will
> be to create the full designs based on the proof of concept.The planned
> date is Aug 12, I will follow up with the exact time later.
>
> On Thu, Jun 30, 2022 at 4:35 PM Valentyn Kahamlyk <
> valentynk@bitquilltech.com> wrote:
>
> > Hello Everyone,
> >
> > I would like to propose exploring options to use Arrow Flight as a
> transport for Gremlin Server. Currently Gremlin Server and Clients are
> based on WebSockets with a custom sub-protocol and serialization to
> GraphSON and GraphBinary.  Developers for each driver must implement those
> protocols from scratch and there is a limited amount of code which is being
> reused (only 3rd party WebSocket libraries are currently reused in the
> client variants). The protocol implementation is a complicated and
> error-prone process, so most drivers only support some subset of Gremlin
> Server features. The maintenance cost is also constantly increasing with
> the number of new client variants being added to TinkerPop.
> >
> > ** Motivation **
> > We would like to propose a solution to reduce maintenance and simplify
> the development of the client drivers by using a standard protocol based on
> the Apache Arrow Flight. As Arrow Flight is implemented in the most common
> languages like C++, C#, Java and Python we anticipate a larger amount of
> existing codebase can be reused which would help to reduce maintenance
> costs in the future. Also, we can reuse some other Arrow Flight features
> like authentication and error handling.
> >
> > ** Assumptions **
> > Proof of Concept Development will be done with Java 8.
> > Need to reuse existing code as much as possible.
> > It is desirable, but not necessary, to maintain compatibility with
> existing drivers.
> > To simplify development at the initial stage, we will reuse existing
> serialization mechanisms.
> >
> > ** Requirements **
> > Gremlin Server and drivers should replace the network layer with Arrow
> Flight.
> > No significant drop in performance.
> > Gremlin Arrow must pass the Gherkin test suite.
> >
> > ** Prototype Design Overview **
> > We would like to explore solution below and create prototype to prove
> approach is feasible.
> > The main idea is to replace the transport layer with FlightServer and
> FlightClient. They support asynchronous data transfer, splitting data into
> chunks, and authorization. While Arrow Flight typically requires schema, in
> a short term we can proceed with implementation using existing serializers
> and GraphBinary format. By using GraphBinary we will not have all
> capabilities that Arrow Flight provides out of the box, like efficient
> compression. However, in the future, we see the value of adding
> capabilities to generate a schema from the server-side, and that can enable
> additional use cases.
> >
> > First stage: replace transport layer, but keep serializers
> > Pros:
> > Reduction of the code base to be developed and maintained
> > A relatively low number of modifications
> >
> > Cons:
> > We may observe reduced performance due to schema transfer and other
> overhead. As part of the PoC we will assess performance overhead for small
> and large responses and identify options to mitigate it.
> > Still need to support GraphBinary serialization.
> >
> > Second stage: replace transport layer, make dynamic schema generation
> and use native Arrow structures for data transmission
> > Pros:
> > Greater reduction of the codebase to be developed and maintained
> > In addition, need to rework the serialization and add schema generation
> > Performance can be improved for large data sets due to Arrow Flight
> optimizations and the ability to transfer data in parallel
> > No need to support GraphBinary and GraphSON serialization protocols
> >
> > Cons:
> > Reduced performance for small result sets
> > Can be complicated and expensive to generate a schema for each request
> >
> > Please find few more diagrams attached in the pdf file attached and
> please share your thoughts.
> >
> > Regards, Valentyn
> >
> >
>


-- 

*Lyndon Bauto*
*Senior Software Engineer*
*Aerospike, Inc.*
www.aerospike.com
lbauto@aerospike.com

Re: Design proposal to use Arrow Flight as transport for Gremlin Server

Posted by Valentyn Kahamlyk <va...@bitquilltech.com.INVALID>.
Hello all,I'm hosting in Discord a short demo for proof of concept using
Arrow Flight with Gremlin, using string queries and GraphSON for
serialization. Any questions and comments are welcome. The next step will
be to create the full designs based on the proof of concept.The planned
date is Aug 12, I will follow up with the exact time later.

On Thu, Jun 30, 2022 at 4:35 PM Valentyn Kahamlyk <
valentynk@bitquilltech.com> wrote:

> Hello Everyone,
>
> I would like to propose exploring options to use Arrow Flight as a transport for Gremlin Server. Currently Gremlin Server and Clients are based on WebSockets with a custom sub-protocol and serialization to GraphSON and GraphBinary.  Developers for each driver must implement those protocols from scratch and there is a limited amount of code which is being reused (only 3rd party WebSocket libraries are currently reused in the client variants). The protocol implementation is a complicated and error-prone process, so most drivers only support some subset of Gremlin Server features. The maintenance cost is also constantly increasing with the number of new client variants being added to TinkerPop.
>
> ** Motivation **
> We would like to propose a solution to reduce maintenance and simplify the development of the client drivers by using a standard protocol based on the Apache Arrow Flight. As Arrow Flight is implemented in the most common languages like C++, C#, Java and Python we anticipate a larger amount of existing codebase can be reused which would help to reduce maintenance costs in the future. Also, we can reuse some other Arrow Flight features like authentication and error handling.
>
> ** Assumptions **
> Proof of Concept Development will be done with Java 8.
> Need to reuse existing code as much as possible.
> It is desirable, but not necessary, to maintain compatibility with existing drivers.
> To simplify development at the initial stage, we will reuse existing serialization mechanisms.
>
> ** Requirements **
> Gremlin Server and drivers should replace the network layer with Arrow Flight.
> No significant drop in performance.
> Gremlin Arrow must pass the Gherkin test suite.
>
> ** Prototype Design Overview **
> We would like to explore solution below and create prototype to prove approach is feasible.
> The main idea is to replace the transport layer with FlightServer and FlightClient. They support asynchronous data transfer, splitting data into chunks, and authorization. While Arrow Flight typically requires schema, in a short term we can proceed with implementation using existing serializers and GraphBinary format. By using GraphBinary we will not have all capabilities that Arrow Flight provides out of the box, like efficient compression. However, in the future, we see the value of adding capabilities to generate a schema from the server-side, and that can enable additional use cases.
>
> First stage: replace transport layer, but keep serializers
> Pros:
> Reduction of the code base to be developed and maintained
> A relatively low number of modifications
>
> Cons:
> We may observe reduced performance due to schema transfer and other overhead. As part of the PoC we will assess performance overhead for small and large responses and identify options to mitigate it.
> Still need to support GraphBinary serialization.
>
> Second stage: replace transport layer, make dynamic schema generation and use native Arrow structures for data transmission
> Pros:
> Greater reduction of the codebase to be developed and maintained
> In addition, need to rework the serialization and add schema generation
> Performance can be improved for large data sets due to Arrow Flight optimizations and the ability to transfer data in parallel
> No need to support GraphBinary and GraphSON serialization protocols
>
> Cons:
> Reduced performance for small result sets
> Can be complicated and expensive to generate a schema for each request
>
> Please find few more diagrams attached in the pdf file attached and please share your thoughts.
>
> Regards, Valentyn
>
>

Re: Design proposal to use Arrow Flight as transport for Gremlin Server

Posted by Valentyn Kahamlyk <va...@bitquilltech.com.INVALID>.
Hello Joshua,

Thanks for the quick feedback!
I think we can make future support easier by removing WebSocket if Arrow
Flight does the job.

Best regards, Valentyn

On Thu, Jun 30, 2022 at 5:14 PM Joshua Shinavier <jo...@fortytwo.net> wrote:

> Hi Valentyn,
>
> Thank you for the proposal/summary. Leo Meyerovich and others have
> previously suggested adding Arrow support to TinkerPop; it just hasn't been
> prioritized. I like everything about your description apart from this
> phrase: "should replace the network layer with Arrow Flight". You are not
> suggesting that the WebSocket-based solution be removed, are you? If the
> two could exist in parallel, it definitely would be nice to have an Arrow
> option. WebSocket could perhaps be dropped later if it isn't being used
> much and/or the maintenance burden is too high. Just my $0.02.
>
> Josh
>
>
>
> On Thu, Jun 30, 2022 at 4:36 PM Valentyn Kahamlyk
> <va...@bitquilltech.com.invalid> wrote:
>
> > Hello Everyone,
> >
> > I would like to propose exploring options to use Arrow Flight as a
> transport for Gremlin Server. Currently Gremlin Server and Clients are
> based on WebSockets with a custom sub-protocol and serialization to
> GraphSON and GraphBinary.  Developers for each driver must implement those
> protocols from scratch and there is a limited amount of code which is being
> reused (only 3rd party WebSocket libraries are currently reused in the
> client variants). The protocol implementation is a complicated and
> error-prone process, so most drivers only support some subset of Gremlin
> Server features. The maintenance cost is also constantly increasing with
> the number of new client variants being added to TinkerPop.
> >
> > ** Motivation **
> > We would like to propose a solution to reduce maintenance and simplify
> the development of the client drivers by using a standard protocol based on
> the Apache Arrow Flight. As Arrow Flight is implemented in the most common
> languages like C++, C#, Java and Python we anticipate a larger amount of
> existing codebase can be reused which would help to reduce maintenance
> costs in the future. Also, we can reuse some other Arrow Flight features
> like authentication and error handling.
> >
> > ** Assumptions **
> > Proof of Concept Development will be done with Java 8.
> > Need to reuse existing code as much as possible.
> > It is desirable, but not necessary, to maintain compatibility with
> existing drivers.
> > To simplify development at the initial stage, we will reuse existing
> serialization mechanisms.
> >
> > ** Requirements **
> > Gremlin Server and drivers should replace the network layer with Arrow
> Flight.
> > No significant drop in performance.
> > Gremlin Arrow must pass the Gherkin test suite.
> >
> > ** Prototype Design Overview **
> > We would like to explore solution below and create prototype to prove
> approach is feasible.
> > The main idea is to replace the transport layer with FlightServer and
> FlightClient. They support asynchronous data transfer, splitting data into
> chunks, and authorization. While Arrow Flight typically requires schema, in
> a short term we can proceed with implementation using existing serializers
> and GraphBinary format. By using GraphBinary we will not have all
> capabilities that Arrow Flight provides out of the box, like efficient
> compression. However, in the future, we see the value of adding
> capabilities to generate a schema from the server-side, and that can enable
> additional use cases.
> >
> > First stage: replace transport layer, but keep serializers
> > Pros:
> > Reduction of the code base to be developed and maintained
> > A relatively low number of modifications
> >
> > Cons:
> > We may observe reduced performance due to schema transfer and other
> overhead. As part of the PoC we will assess performance overhead for small
> and large responses and identify options to mitigate it.
> > Still need to support GraphBinary serialization.
> >
> > Second stage: replace transport layer, make dynamic schema generation
> and use native Arrow structures for data transmission
> > Pros:
> > Greater reduction of the codebase to be developed and maintained
> > In addition, need to rework the serialization and add schema generation
> > Performance can be improved for large data sets due to Arrow Flight
> optimizations and the ability to transfer data in parallel
> > No need to support GraphBinary and GraphSON serialization protocols
> >
> > Cons:
> > Reduced performance for small result sets
> > Can be complicated and expensive to generate a schema for each request
> >
> > Please find few more diagrams attached in the pdf file attached and
> please share your thoughts.
> >
> > Regards, Valentyn
> >
> >
>

Re: Design proposal to use Arrow Flight as transport for Gremlin Server

Posted by Joshua Shinavier <jo...@fortytwo.net>.
Hi Valentyn,

Thank you for the proposal/summary. Leo Meyerovich and others have
previously suggested adding Arrow support to TinkerPop; it just hasn't been
prioritized. I like everything about your description apart from this
phrase: "should replace the network layer with Arrow Flight". You are not
suggesting that the WebSocket-based solution be removed, are you? If the
two could exist in parallel, it definitely would be nice to have an Arrow
option. WebSocket could perhaps be dropped later if it isn't being used
much and/or the maintenance burden is too high. Just my $0.02.

Josh



On Thu, Jun 30, 2022 at 4:36 PM Valentyn Kahamlyk
<va...@bitquilltech.com.invalid> wrote:

> Hello Everyone,
>
> I would like to propose exploring options to use Arrow Flight as a transport for Gremlin Server. Currently Gremlin Server and Clients are based on WebSockets with a custom sub-protocol and serialization to GraphSON and GraphBinary.  Developers for each driver must implement those protocols from scratch and there is a limited amount of code which is being reused (only 3rd party WebSocket libraries are currently reused in the client variants). The protocol implementation is a complicated and error-prone process, so most drivers only support some subset of Gremlin Server features. The maintenance cost is also constantly increasing with the number of new client variants being added to TinkerPop.
>
> ** Motivation **
> We would like to propose a solution to reduce maintenance and simplify the development of the client drivers by using a standard protocol based on the Apache Arrow Flight. As Arrow Flight is implemented in the most common languages like C++, C#, Java and Python we anticipate a larger amount of existing codebase can be reused which would help to reduce maintenance costs in the future. Also, we can reuse some other Arrow Flight features like authentication and error handling.
>
> ** Assumptions **
> Proof of Concept Development will be done with Java 8.
> Need to reuse existing code as much as possible.
> It is desirable, but not necessary, to maintain compatibility with existing drivers.
> To simplify development at the initial stage, we will reuse existing serialization mechanisms.
>
> ** Requirements **
> Gremlin Server and drivers should replace the network layer with Arrow Flight.
> No significant drop in performance.
> Gremlin Arrow must pass the Gherkin test suite.
>
> ** Prototype Design Overview **
> We would like to explore solution below and create prototype to prove approach is feasible.
> The main idea is to replace the transport layer with FlightServer and FlightClient. They support asynchronous data transfer, splitting data into chunks, and authorization. While Arrow Flight typically requires schema, in a short term we can proceed with implementation using existing serializers and GraphBinary format. By using GraphBinary we will not have all capabilities that Arrow Flight provides out of the box, like efficient compression. However, in the future, we see the value of adding capabilities to generate a schema from the server-side, and that can enable additional use cases.
>
> First stage: replace transport layer, but keep serializers
> Pros:
> Reduction of the code base to be developed and maintained
> A relatively low number of modifications
>
> Cons:
> We may observe reduced performance due to schema transfer and other overhead. As part of the PoC we will assess performance overhead for small and large responses and identify options to mitigate it.
> Still need to support GraphBinary serialization.
>
> Second stage: replace transport layer, make dynamic schema generation and use native Arrow structures for data transmission
> Pros:
> Greater reduction of the codebase to be developed and maintained
> In addition, need to rework the serialization and add schema generation
> Performance can be improved for large data sets due to Arrow Flight optimizations and the ability to transfer data in parallel
> No need to support GraphBinary and GraphSON serialization protocols
>
> Cons:
> Reduced performance for small result sets
> Can be complicated and expensive to generate a schema for each request
>
> Please find few more diagrams attached in the pdf file attached and please share your thoughts.
>
> Regards, Valentyn
>
>