You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Paul Taylor <pt...@apache.org> on 2018/05/20 03:35:53 UTC

Proposed Arrow Graph representations

At GTC San Jose last month, NVidia's Joe Eaton (cc'd) presented on the
nvGraph <https://developer.nvidia.com/nvgraph> team's goals for
accelerating in-memory graph processing and analytics. A major component of
that is advancing and standardizing a common, efficient representation for
graphs that can support a broad ranges of use-cases, from small to large.

To that end, I'd like to kick off the discussion about native graph
representations in Arrow.

Joe's team has prepared a preliminary FlatBuffers schema for efficient
columnar representations of the four most common graph formats. It includes
embedded edge and vertex property tables, and is designed to be compatible
with the existing Arrow column types. My initial thoughts are that we could
add an optional 5th Graph Message type, similar to how Tensor Messages are
presently implemented.

I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
<https://github.com/trxcllnt/arrow/blob/78f6b6c6a5b9e4e7bf96f5bbc4dfed7528b1cca7/format/GraphSchema_Triples_Quads.fbs>.
From what I understand, the tables have been expanded into separate
definitions for the sake of comprehension, and the final forms will be
collapsed into each distinct Graph type, parameterized by sizes defined at
the top.

I also understand the nvGraph team supports these layouts natively,
enabling the community to take advantage of high-performance GPU kernels
very early on, and possibly align with libraries like Hornet
<https://github.com/hornet-gt/hornetsnest> (previously cuStinger).

Cheers,
Paul

Re: Proposed Arrow Graph representations

Posted by Wes McKinney <we...@gmail.com>.

hi folks,

I have glanced at the Flatbuffers file with the proposed graph
schemas. IP / licensing problems aside, I don't know enough about
graph representations to have the context to judge whether this is the
correct approach.

My initial reaction is that the file is very long and without a great
deal of comments to help understand the details; the intent of the
metadata we have so far (i.e. Schema.fbs, etc.) is to describe record
batch schemas and to provide a "data header" describing the locations
of memory blocks in each type of message. It is not the intent that
the Flatbuffers contain actual data, just metadata to enable memory
blocks to be interpreted correctly.

Maybe the best way forward would be to write some documentation
providing a comprehensive description of the serialization / data
access paradigm; so if you start with some example graph data, then
show how it is converted to the Arrow-based graph representation. What
are the scalability characteristics / limitations (e.g. a single piece
of metadata cannot exceed 2GB, does that cause problems)? Are there
other tradeoffs to be aware of?

Thanks,
Wes

On Mon, May 21, 2018 at 2:14 PM, Wes McKinney <we...@gmail.com> wrote:
> hi Josh,
>
> Yes, the standard process for importing externally-developed code is
> the Incubator IP clearance: http://incubator.apache.org/ip-clearance/.
> As an example, we recently received a Go codebase donation from
> InfluxData where there was a combination of ICLAs from the
> contributors and a software grant agreement:
> http://incubator.apache.org/ip-clearance/arrow-go-library.html. We did
> this for Plasma, too.
>
> Needless to say, whenever possible if new work can be done in Apache
> Arrow and with community process, it spares the PMC a lot of work and
> IP / licensing review to avoid the IP clearance process.
>
> - Wes
>
> On Mon, May 21, 2018 at 1:41 PM, Joshua Patterson <jo...@nvidia.com> wrote:
>> Hi Wes,
>> I'm sure we're going to run into this with libgdf/pygdf as well.  Is there a systematic way we could do a transfer of IP?
>>
>> On 5/20/18, 7:05 PM, "Wes McKinney" <we...@gmail.com> wrote:
>>
>>     hi Paul,
>>
>>     This is a great discussion to get started. I will review the patch in
>>     some more detail and send feedback
>>
>>     > I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
>>
>>     I'm concerned the way this patch is set up right now is a little bit
>>     problematic from an IP lineage standpoint (since this is his/Nvidia's
>>     code and not yours). Would it be possible for Joe to create a pull
>>     request directly for this instead? We can create a branch somewhere
>>     where we can collaborate, too, if that helps.
>>
>>     Thanks,
>>     Wes
>>
>>     On Sat, May 19, 2018 at 11:35 PM, Paul Taylor <pt...@apache.org> wrote:
>>     > At GTC San Jose last month, NVidia's Joe Eaton (cc'd) presented on the
>>     > nvGraph <https://developer.nvidia.com/nvgraph> team's goals for
>>     > accelerating in-memory graph processing and analytics. A major component of
>>     > that is advancing and standardizing a common, efficient representation for
>>     > graphs that can support a broad ranges of use-cases, from small to large.
>>     >
>>     > To that end, I'd like to kick off the discussion about native graph
>>     > representations in Arrow.
>>     >
>>     > Joe's team has prepared a preliminary FlatBuffers schema for efficient
>>     > columnar representations of the four most common graph formats. It includes
>>     > embedded edge and vertex property tables, and is designed to be compatible
>>     > with the existing Arrow column types. My initial thoughts are that we could
>>     > add an optional 5th Graph Message type, similar to how Tensor Messages are
>>     > presently implemented.
>>     >
>>     > I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
>>     > <https://github.com/trxcllnt/arrow/blob/78f6b6c6a5b9e4e7bf96f5bbc4dfed7528b1cca7/format/GraphSchema_Triples_Quads.fbs>.
>>     > From what I understand, the tables have been expanded into separate
>>     > definitions for the sake of comprehension, and the final forms will be
>>     > collapsed into each distinct Graph type, parameterized by sizes defined at
>>     > the top.
>>     >
>>     > I also understand the nvGraph team supports these layouts natively,
>>     > enabling the community to take advantage of high-performance GPU kernels
>>     > very early on, and possibly align with libraries like Hornet
>>     > <https://github.com/hornet-gt/hornetsnest> (previously cuStinger).
>>     >
>>     > Cheers,
>>     > Paul
>>
>>
>>
>> -----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and may contain
>> confidential information.  Any unauthorized review, use, disclosure or distribution
>> is prohibited.  If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> -----------------------------------------------------------------------------------

Re: Proposed Arrow Graph representations

Posted by Wes McKinney <we...@gmail.com>.

hi Josh,

Yes, the standard process for importing externally-developed code is
the Incubator IP clearance: http://incubator.apache.org/ip-clearance/.
As an example, we recently received a Go codebase donation from
InfluxData where there was a combination of ICLAs from the
contributors and a software grant agreement:
http://incubator.apache.org/ip-clearance/arrow-go-library.html. We did
this for Plasma, too.

Needless to say, whenever possible if new work can be done in Apache
Arrow and with community process, it spares the PMC a lot of work and
IP / licensing review to avoid the IP clearance process.

- Wes

On Mon, May 21, 2018 at 1:41 PM, Joshua Patterson <jo...@nvidia.com> wrote:
> Hi Wes,
> I'm sure we're going to run into this with libgdf/pygdf as well.  Is there a systematic way we could do a transfer of IP?
>
> On 5/20/18, 7:05 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
>     hi Paul,
>
>     This is a great discussion to get started. I will review the patch in
>     some more detail and send feedback
>
>     > I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
>
>     I'm concerned the way this patch is set up right now is a little bit
>     problematic from an IP lineage standpoint (since this is his/Nvidia's
>     code and not yours). Would it be possible for Joe to create a pull
>     request directly for this instead? We can create a branch somewhere
>     where we can collaborate, too, if that helps.
>
>     Thanks,
>     Wes
>
>     On Sat, May 19, 2018 at 11:35 PM, Paul Taylor <pt...@apache.org> wrote:
>     > At GTC San Jose last month, NVidia's Joe Eaton (cc'd) presented on the
>     > nvGraph <https://developer.nvidia.com/nvgraph> team's goals for
>     > accelerating in-memory graph processing and analytics. A major component of
>     > that is advancing and standardizing a common, efficient representation for
>     > graphs that can support a broad ranges of use-cases, from small to large.
>     >
>     > To that end, I'd like to kick off the discussion about native graph
>     > representations in Arrow.
>     >
>     > Joe's team has prepared a preliminary FlatBuffers schema for efficient
>     > columnar representations of the four most common graph formats. It includes
>     > embedded edge and vertex property tables, and is designed to be compatible
>     > with the existing Arrow column types. My initial thoughts are that we could
>     > add an optional 5th Graph Message type, similar to how Tensor Messages are
>     > presently implemented.
>     >
>     > I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
>     > <https://github.com/trxcllnt/arrow/blob/78f6b6c6a5b9e4e7bf96f5bbc4dfed7528b1cca7/format/GraphSchema_Triples_Quads.fbs>.
>     > From what I understand, the tables have been expanded into separate
>     > definitions for the sake of comprehension, and the final forms will be
>     > collapsed into each distinct Graph type, parameterized by sizes defined at
>     > the top.
>     >
>     > I also understand the nvGraph team supports these layouts natively,
>     > enabling the community to take advantage of high-performance GPU kernels
>     > very early on, and possibly align with libraries like Hornet
>     > <https://github.com/hornet-gt/hornetsnest> (previously cuStinger).
>     >
>     > Cheers,
>     > Paul
>
>
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information.  Any unauthorized review, use, disclosure or distribution
> is prohibited.  If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------

Re: Proposed Arrow Graph representations

Posted by Joshua Patterson <jo...@nvidia.com>.

Hi Wes,
I'm sure we're going to run into this with libgdf/pygdf as well.  Is there a systematic way we could do a transfer of IP?

On 5/20/18, 7:05 PM, "Wes McKinney" <we...@gmail.com> wrote:

    hi Paul,
    
    This is a great discussion to get started. I will review the patch in
    some more detail and send feedback
    
    > I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
    
    I'm concerned the way this patch is set up right now is a little bit
    problematic from an IP lineage standpoint (since this is his/Nvidia's
    code and not yours). Would it be possible for Joe to create a pull
    request directly for this instead? We can create a branch somewhere
    where we can collaborate, too, if that helps.
    
    Thanks,
    Wes
    
    On Sat, May 19, 2018 at 11:35 PM, Paul Taylor <pt...@apache.org> wrote:
    > At GTC San Jose last month, NVidia's Joe Eaton (cc'd) presented on the
    > nvGraph <https://developer.nvidia.com/nvgraph> team's goals for
    > accelerating in-memory graph processing and analytics. A major component of
    > that is advancing and standardizing a common, efficient representation for
    > graphs that can support a broad ranges of use-cases, from small to large.
    >
    > To that end, I'd like to kick off the discussion about native graph
    > representations in Arrow.
    >
    > Joe's team has prepared a preliminary FlatBuffers schema for efficient
    > columnar representations of the four most common graph formats. It includes
    > embedded edge and vertex property tables, and is designed to be compatible
    > with the existing Arrow column types. My initial thoughts are that we could
    > add an optional 5th Graph Message type, similar to how Tensor Messages are
    > presently implemented.
    >
    > I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
    > <https://github.com/trxcllnt/arrow/blob/78f6b6c6a5b9e4e7bf96f5bbc4dfed7528b1cca7/format/GraphSchema_Triples_Quads.fbs>.
    > From what I understand, the tables have been expanded into separate
    > definitions for the sake of comprehension, and the final forms will be
    > collapsed into each distinct Graph type, parameterized by sizes defined at
    > the top.
    >
    > I also understand the nvGraph team supports these layouts natively,
    > enabling the community to take advantage of high-performance GPU kernels
    > very early on, and possibly align with libraries like Hornet
    > <https://github.com/hornet-gt/hornetsnest> (previously cuStinger).
    >
    > Cheers,
    > Paul
    


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Re: Proposed Arrow Graph representations

Posted by Wes McKinney <we...@gmail.com>.

hi Paul,

This is a great discussion to get started. I will review the patch in
some more detail and send feedback

> I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork

I'm concerned the way this patch is set up right now is a little bit
problematic from an IP lineage standpoint (since this is his/Nvidia's
code and not yours). Would it be possible for Joe to create a pull
request directly for this instead? We can create a branch somewhere
where we can collaborate, too, if that helps.

Thanks,
Wes

On Sat, May 19, 2018 at 11:35 PM, Paul Taylor <pt...@apache.org> wrote:
> At GTC San Jose last month, NVidia's Joe Eaton (cc'd) presented on the
> nvGraph <https://developer.nvidia.com/nvgraph> team's goals for
> accelerating in-memory graph processing and analytics. A major component of
> that is advancing and standardizing a common, efficient representation for
> graphs that can support a broad ranges of use-cases, from small to large.
>
> To that end, I'd like to kick off the discussion about native graph
> representations in Arrow.
>
> Joe's team has prepared a preliminary FlatBuffers schema for efficient
> columnar representations of the four most common graph formats. It includes
> embedded edge and vertex property tables, and is designed to be compatible
> with the existing Arrow column types. My initial thoughts are that we could
> add an optional 5th Graph Message type, similar to how Tensor Messages are
> presently implemented.
>
> I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
> <https://github.com/trxcllnt/arrow/blob/78f6b6c6a5b9e4e7bf96f5bbc4dfed7528b1cca7/format/GraphSchema_Triples_Quads.fbs>.
> From what I understand, the tables have been expanded into separate
> definitions for the sake of comprehension, and the final forms will be
> collapsed into each distinct Graph type, parameterized by sizes defined at
> the top.
>
> I also understand the nvGraph team supports these layouts natively,
> enabling the community to take advantage of high-performance GPU kernels
> very early on, and possibly align with libraries like Hornet
> <https://github.com/hornet-gt/hornetsnest> (previously cuStinger).
>
> Cheers,
> Paul