You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tinkerpop.apache.org by Alaa Mahmoud <al...@gmail.com> on 2015/12/01 16:46:48 UTC

[DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Adding support for loading CSV into a graph using Gremlin's GraphReader
will lower the entry barrier for new users. A lot of data is already in CSV
format and a lot of existing databases/repositories allow users to export
their data as CSV.

I'd like to add this capability to the gremlin core as a new GraphReader
instance. Since the CSV data doesn't map directly to nodes and vertexes,
I'm planning to do the loading on two steps:

*Nodes*
The first is to load a CSV as vertex CSV file. I'll create a node for every
line in the csv and a property for each column on that line. If the csv has
column headers, then the names of the columns will be the names of the
corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
(There are other ways to do it as well, but I'm just trying to show the
general idea)

*Edges*
The second step is loading the edges csv file which will be in the
following format

vertex1 prop name (source vertex), vertex2 prop name (destination vertex),
bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...

For each line in the edge csv file, the reader will search for a vertex
with the vertex1 prop value (caller need to ensure it's unique) to find the
source vertex, search for a destination vertex with destination prop value
and then create an edge that ties the two together. We will be creating an
edge property for each additional property on the line.

Thoughts?

Alaa

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Posted by Stephen Mallette <sp...@gmail.com>.

Dylan - thanks for your input. What you said actually gets at the direction
I was heading when I asked about how "types" would be handled and
underscores what I perceive is a greater level of complexity for this task
than is handled well in the standard GraphReader/Writer interfaces.

On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell <
dylan.bethune.waddell@mail.utoronto.ca> wrote:

> I wrote a command line utility in Groovy that would do this for Titan -
> here's how it worked:
>
> 1) Either a file or directory path for vertices/edges was passed.
> 2) Optional regex for extracting the vertex label from the file name(s).
>     - Default is to split on underscores/dash/whitespace and take
>       element [0] (the label in the file would give more flexibility).
>     - These files are batched according to available processors.
>     - A transaction was opened to load each file from each batch.
> 3) Vertices - 1st column as id property, remaining additional props.
>     - Should just be selection of the desired named/positional column.
>     - The user should be able to provide an id mapping file:
>        a. Restricts ids they care to load by the mapped-to ids.
>        b. Shows coverage of their intended id conversion over file lines.
> 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
>     - Should also be generalized to selection/configuration by user.
> 5) Type - append after colon to the column header e.g. "name:int"
>     - Type is often inferred from the first hundred lines of the file.
>     - But when inconsistencies are further along than that, ugh.
>
> I never managed to get the "interactive" part working before I moved
> on from this, but I think it's essential as the user should not have to
> hack on the CSV data much to get it to load. My idea was displaying
> the file headers, getting the user to mark which has the "identifier"
> (for Titan was just a property key under a unique index), asking them
> if they have a map file for that identifier, and finally asking them to
> confirm the types we inferred based on the first 100 lines or a sampling
> of lines or whatever with an option to "just do it already". Then, if the
> user is trying to load a gazillion CSV files from a directory or set of
> directories, we just ask them for "profiles" like this to apply per
> directory,
> per file name matching some regex or criteria about its n x m column
> shape, or something else to distinguish multiple files from each other.
> Same general thing applies to edges. Of course, all this should be
> possible to tuck away in a configuration file, or provide as arguments
> to a "builder" in the REPL somehow - I think that could get confusing
> fast, but with similar hand-holding to the above it could be workable.
>
> For parsing the file, I think it needs reasonable defaults but like most
> CSV parsing frameworks, provide the option to change the quote
> character, line terminator, delimiter, skip n lines at the front, n lines
> at
> the back, and all that stuff.
>
> Hope that helps somewhat - sorry for the spam if this could have gone
> unsaid.
>
> ________________________________________
> From: Stephen Mallette <sp...@gmail.com>
> Sent: Wednesday, December 2, 2015 6:55 AM
> To: dev@tinkerpop.incubator.apache.org
> Subject: Re: [DISCUSS] Add native CSV loading support for gremlin
> (GraphReader)
>
> Thanks for bringing this up for discussion and offering to work on it. You
> don't make mention of how you will deal with data types - will you have
> some way to give users some fine-grained control of that?
>
>
>
>
>
> On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <al...@gmail.com>
> wrote:
>
> > Adding support for loading CSV into a graph using Gremlin's GraphReader
> > will lower the entry barrier for new users. A lot of data is already in
> CSV
> > format and a lot of existing databases/repositories allow users to export
> > their data as CSV.
> >
> > I'd like to add this capability to the gremlin core as a new GraphReader
> > instance. Since the CSV data doesn't map directly to nodes and vertexes,
> > I'm planning to do the loading on two steps:
> >
> > *Nodes*
> > The first is to load a CSV as vertex CSV file. I'll create a node for
> every
> > line in the csv and a property for each column on that line. If the csv
> has
> > column headers, then the names of the columns will be the names of the
> > corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
> > (There are other ways to do it as well, but I'm just trying to show the
> > general idea)
> >
> > *Edges*
> > The second step is loading the edges csv file which will be in the
> > following format
> >
> > vertex1 prop name (source vertex), vertex2 prop name (destination
> vertex),
> > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
> >
> > For each line in the edge csv file, the reader will search for a vertex
> > with the vertex1 prop value (caller need to ensure it's unique) to find
> the
> > source vertex, search for a destination vertex with destination prop
> value
> > and then create an edge that ties the two together. We will be creating
> an
> > edge property for each additional property on the line.
> >
> > Thoughts?
> >
> > Alaa
> >
>

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Posted by Alaa Mahmoud <al...@gmail.com>.

My intent is to add a new instance of GraphReader (GraphCSVReader) to read
csv files but If the community feels it should be a 3rd party IO then
that's fine as well.

Alaa

On Fri, Dec 4, 2015 at 10:28 AM, Stephen Mallette <sp...@gmail.com>
wrote:

> >
> > How about we create and enhanced version of GraphReader that takes a
> schema
> > and a parser parameters (separator character, column headers or not,
> > encoding,etc...)?
>
>
> I'd be against making that kind of change as Schema is not a first class
> citizen in TinkerPop.  We haven't yet made that leap to include that and
> I'd say any proposal to deal with that in IO would need to be dealt with in
> the much broader terms of the whole TinkerPop ecosystem.  You can search
> the lists for the various discussions that have been had on that if you are
> interested.
>
> On a separate note, I'm not sure if you've explained what your intent is,
> but my personal opinion is that you should develop this capability in your
> own repo and offer it to the community as a third-party IO library. Not
> sure how others feel about it.
>
> On Fri, Dec 4, 2015 at 10:17 AM, Alaa Mahmoud <al...@gmail.com>
> wrote:
>
> > Thanks Stephen and Dylan for your response. I was trying to work within
> > what GraphReader currently offers which isn't idea. Ideally GraphReader
> > will allow to pass a schema for the file being rather than having each
> > instance figure out a way to do that. The less we require users to modify
> > their data before using TP, the more successful we'll be.
> >
> > How about we create and enhanced version of GraphReader that takes a
> schema
> > and a parser parameters (separator character, column headers or not,
> > encoding,etc...)?
> >
> > Dylan, thanks for sharing how your tool works and it has the same essence
> > as what I'm trying to do. I like the idea of having the types in the
> column
> > header next to each column name.
> >
> > We can start with something simple and then enhance it as we go.
> >
> > Regards
> >
> > On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell <
> > dylan.bethune.waddell@mail.utoronto.ca> wrote:
> >
> > > I wrote a command line utility in Groovy that would do this for Titan -
> > > here's how it worked:
> > >
> > > 1) Either a file or directory path for vertices/edges was passed.
> > > 2) Optional regex for extracting the vertex label from the file
> name(s).
> > >     - Default is to split on underscores/dash/whitespace and take
> > >       element [0] (the label in the file would give more flexibility).
> > >     - These files are batched according to available processors.
> > >     - A transaction was opened to load each file from each batch.
> > > 3) Vertices - 1st column as id property, remaining additional props.
> > >     - Should just be selection of the desired named/positional column.
> > >     - The user should be able to provide an id mapping file:
> > >        a. Restricts ids they care to load by the mapped-to ids.
> > >        b. Shows coverage of their intended id conversion over file
> lines.
> > > 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
> > >     - Should also be generalized to selection/configuration by user.
> > > 5) Type - append after colon to the column header e.g. "name:int"
> > >     - Type is often inferred from the first hundred lines of the file.
> > >     - But when inconsistencies are further along than that, ugh.
> > >
> > > I never managed to get the "interactive" part working before I moved
> > > on from this, but I think it's essential as the user should not have to
> > > hack on the CSV data much to get it to load. My idea was displaying
> > > the file headers, getting the user to mark which has the "identifier"
> > > (for Titan was just a property key under a unique index), asking them
> > > if they have a map file for that identifier, and finally asking them to
> > > confirm the types we inferred based on the first 100 lines or a
> sampling
> > > of lines or whatever with an option to "just do it already". Then, if
> the
> > > user is trying to load a gazillion CSV files from a directory or set of
> > > directories, we just ask them for "profiles" like this to apply per
> > > directory,
> > > per file name matching some regex or criteria about its n x m column
> > > shape, or something else to distinguish multiple files from each other.
> > > Same general thing applies to edges. Of course, all this should be
> > > possible to tuck away in a configuration file, or provide as arguments
> > > to a "builder" in the REPL somehow - I think that could get confusing
> > > fast, but with similar hand-holding to the above it could be workable.
> > >
> > > For parsing the file, I think it needs reasonable defaults but like
> most
> > > CSV parsing frameworks, provide the option to change the quote
> > > character, line terminator, delimiter, skip n lines at the front, n
> lines
> > > at
> > > the back, and all that stuff.
> > >
> > > Hope that helps somewhat - sorry for the spam if this could have gone
> > > unsaid.
> > >
> > > ________________________________________
> > > From: Stephen Mallette <sp...@gmail.com>
> > > Sent: Wednesday, December 2, 2015 6:55 AM
> > > To: dev@tinkerpop.incubator.apache.org
> > > Subject: Re: [DISCUSS] Add native CSV loading support for gremlin
> > > (GraphReader)
> > >
> > > Thanks for bringing this up for discussion and offering to work on it.
> > You
> > > don't make mention of how you will deal with data types - will you have
> > > some way to give users some fine-grained control of that?
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <al...@gmail.com>
> > > wrote:
> > >
> > > > Adding support for loading CSV into a graph using Gremlin's
> GraphReader
> > > > will lower the entry barrier for new users. A lot of data is already
> in
> > > CSV
> > > > format and a lot of existing databases/repositories allow users to
> > export
> > > > their data as CSV.
> > > >
> > > > I'd like to add this capability to the gremlin core as a new
> > GraphReader
> > > > instance. Since the CSV data doesn't map directly to nodes and
> > vertexes,
> > > > I'm planning to do the loading on two steps:
> > > >
> > > > *Nodes*
> > > > The first is to load a CSV as vertex CSV file. I'll create a node for
> > > every
> > > > line in the csv and a property for each column on that line. If the
> csv
> > > has
> > > > column headers, then the names of the columns will be the names of
> the
> > > > corresponding vertex property. Otherwise, It'll be prop1, prop2,
> etc...
> > > > (There are other ways to do it as well, but I'm just trying to show
> the
> > > > general idea)
> > > >
> > > > *Edges*
> > > > The second step is loading the edges csv file which will be in the
> > > > following format
> > > >
> > > > vertex1 prop name (source vertex), vertex2 prop name (destination
> > > vertex),
> > > > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
> > > >
> > > > For each line in the edge csv file, the reader will search for a
> vertex
> > > > with the vertex1 prop value (caller need to ensure it's unique) to
> find
> > > the
> > > > source vertex, search for a destination vertex with destination prop
> > > value
> > > > and then create an edge that ties the two together. We will be
> creating
> > > an
> > > > edge property for each additional property on the line.
> > > >
> > > > Thoughts?
> > > >
> > > > Alaa
> > > >
> > >
> >
>

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Posted by Stephen Mallette <sp...@gmail.com>.

>
> How about we create and enhanced version of GraphReader that takes a schema
> and a parser parameters (separator character, column headers or not,
> encoding,etc...)?


I'd be against making that kind of change as Schema is not a first class
citizen in TinkerPop.  We haven't yet made that leap to include that and
I'd say any proposal to deal with that in IO would need to be dealt with in
the much broader terms of the whole TinkerPop ecosystem.  You can search
the lists for the various discussions that have been had on that if you are
interested.

On a separate note, I'm not sure if you've explained what your intent is,
but my personal opinion is that you should develop this capability in your
own repo and offer it to the community as a third-party IO library. Not
sure how others feel about it.

On Fri, Dec 4, 2015 at 10:17 AM, Alaa Mahmoud <al...@gmail.com> wrote:

> Thanks Stephen and Dylan for your response. I was trying to work within
> what GraphReader currently offers which isn't idea. Ideally GraphReader
> will allow to pass a schema for the file being rather than having each
> instance figure out a way to do that. The less we require users to modify
> their data before using TP, the more successful we'll be.
>
> How about we create and enhanced version of GraphReader that takes a schema
> and a parser parameters (separator character, column headers or not,
> encoding,etc...)?
>
> Dylan, thanks for sharing how your tool works and it has the same essence
> as what I'm trying to do. I like the idea of having the types in the column
> header next to each column name.
>
> We can start with something simple and then enhance it as we go.
>
> Regards
>
> On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell <
> dylan.bethune.waddell@mail.utoronto.ca> wrote:
>
> > I wrote a command line utility in Groovy that would do this for Titan -
> > here's how it worked:
> >
> > 1) Either a file or directory path for vertices/edges was passed.
> > 2) Optional regex for extracting the vertex label from the file name(s).
> >     - Default is to split on underscores/dash/whitespace and take
> >       element [0] (the label in the file would give more flexibility).
> >     - These files are batched according to available processors.
> >     - A transaction was opened to load each file from each batch.
> > 3) Vertices - 1st column as id property, remaining additional props.
> >     - Should just be selection of the desired named/positional column.
> >     - The user should be able to provide an id mapping file:
> >        a. Restricts ids they care to load by the mapped-to ids.
> >        b. Shows coverage of their intended id conversion over file lines.
> > 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
> >     - Should also be generalized to selection/configuration by user.
> > 5) Type - append after colon to the column header e.g. "name:int"
> >     - Type is often inferred from the first hundred lines of the file.
> >     - But when inconsistencies are further along than that, ugh.
> >
> > I never managed to get the "interactive" part working before I moved
> > on from this, but I think it's essential as the user should not have to
> > hack on the CSV data much to get it to load. My idea was displaying
> > the file headers, getting the user to mark which has the "identifier"
> > (for Titan was just a property key under a unique index), asking them
> > if they have a map file for that identifier, and finally asking them to
> > confirm the types we inferred based on the first 100 lines or a sampling
> > of lines or whatever with an option to "just do it already". Then, if the
> > user is trying to load a gazillion CSV files from a directory or set of
> > directories, we just ask them for "profiles" like this to apply per
> > directory,
> > per file name matching some regex or criteria about its n x m column
> > shape, or something else to distinguish multiple files from each other.
> > Same general thing applies to edges. Of course, all this should be
> > possible to tuck away in a configuration file, or provide as arguments
> > to a "builder" in the REPL somehow - I think that could get confusing
> > fast, but with similar hand-holding to the above it could be workable.
> >
> > For parsing the file, I think it needs reasonable defaults but like most
> > CSV parsing frameworks, provide the option to change the quote
> > character, line terminator, delimiter, skip n lines at the front, n lines
> > at
> > the back, and all that stuff.
> >
> > Hope that helps somewhat - sorry for the spam if this could have gone
> > unsaid.
> >
> > ________________________________________
> > From: Stephen Mallette <sp...@gmail.com>
> > Sent: Wednesday, December 2, 2015 6:55 AM
> > To: dev@tinkerpop.incubator.apache.org
> > Subject: Re: [DISCUSS] Add native CSV loading support for gremlin
> > (GraphReader)
> >
> > Thanks for bringing this up for discussion and offering to work on it.
> You
> > don't make mention of how you will deal with data types - will you have
> > some way to give users some fine-grained control of that?
> >
> >
> >
> >
> >
> > On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <al...@gmail.com>
> > wrote:
> >
> > > Adding support for loading CSV into a graph using Gremlin's GraphReader
> > > will lower the entry barrier for new users. A lot of data is already in
> > CSV
> > > format and a lot of existing databases/repositories allow users to
> export
> > > their data as CSV.
> > >
> > > I'd like to add this capability to the gremlin core as a new
> GraphReader
> > > instance. Since the CSV data doesn't map directly to nodes and
> vertexes,
> > > I'm planning to do the loading on two steps:
> > >
> > > *Nodes*
> > > The first is to load a CSV as vertex CSV file. I'll create a node for
> > every
> > > line in the csv and a property for each column on that line. If the csv
> > has
> > > column headers, then the names of the columns will be the names of the
> > > corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
> > > (There are other ways to do it as well, but I'm just trying to show the
> > > general idea)
> > >
> > > *Edges*
> > > The second step is loading the edges csv file which will be in the
> > > following format
> > >
> > > vertex1 prop name (source vertex), vertex2 prop name (destination
> > vertex),
> > > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
> > >
> > > For each line in the edge csv file, the reader will search for a vertex
> > > with the vertex1 prop value (caller need to ensure it's unique) to find
> > the
> > > source vertex, search for a destination vertex with destination prop
> > value
> > > and then create an edge that ties the two together. We will be creating
> > an
> > > edge property for each additional property on the line.
> > >
> > > Thoughts?
> > >
> > > Alaa
> > >
> >
>

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Posted by Alaa Mahmoud <al...@gmail.com>.

Thanks Stephen and Dylan for your response. I was trying to work within
what GraphReader currently offers which isn't idea. Ideally GraphReader
will allow to pass a schema for the file being rather than having each
instance figure out a way to do that. The less we require users to modify
their data before using TP, the more successful we'll be.

How about we create and enhanced version of GraphReader that takes a schema
and a parser parameters (separator character, column headers or not,
encoding,etc...)?

Dylan, thanks for sharing how your tool works and it has the same essence
as what I'm trying to do. I like the idea of having the types in the column
header next to each column name.

We can start with something simple and then enhance it as we go.

Regards

On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell <
dylan.bethune.waddell@mail.utoronto.ca> wrote:

> I wrote a command line utility in Groovy that would do this for Titan -
> here's how it worked:
>
> 1) Either a file or directory path for vertices/edges was passed.
> 2) Optional regex for extracting the vertex label from the file name(s).
>     - Default is to split on underscores/dash/whitespace and take
>       element [0] (the label in the file would give more flexibility).
>     - These files are batched according to available processors.
>     - A transaction was opened to load each file from each batch.
> 3) Vertices - 1st column as id property, remaining additional props.
>     - Should just be selection of the desired named/positional column.
>     - The user should be able to provide an id mapping file:
>        a. Restricts ids they care to load by the mapped-to ids.
>        b. Shows coverage of their intended id conversion over file lines.
> 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
>     - Should also be generalized to selection/configuration by user.
> 5) Type - append after colon to the column header e.g. "name:int"
>     - Type is often inferred from the first hundred lines of the file.
>     - But when inconsistencies are further along than that, ugh.
>
> I never managed to get the "interactive" part working before I moved
> on from this, but I think it's essential as the user should not have to
> hack on the CSV data much to get it to load. My idea was displaying
> the file headers, getting the user to mark which has the "identifier"
> (for Titan was just a property key under a unique index), asking them
> if they have a map file for that identifier, and finally asking them to
> confirm the types we inferred based on the first 100 lines or a sampling
> of lines or whatever with an option to "just do it already". Then, if the
> user is trying to load a gazillion CSV files from a directory or set of
> directories, we just ask them for "profiles" like this to apply per
> directory,
> per file name matching some regex or criteria about its n x m column
> shape, or something else to distinguish multiple files from each other.
> Same general thing applies to edges. Of course, all this should be
> possible to tuck away in a configuration file, or provide as arguments
> to a "builder" in the REPL somehow - I think that could get confusing
> fast, but with similar hand-holding to the above it could be workable.
>
> For parsing the file, I think it needs reasonable defaults but like most
> CSV parsing frameworks, provide the option to change the quote
> character, line terminator, delimiter, skip n lines at the front, n lines
> at
> the back, and all that stuff.
>
> Hope that helps somewhat - sorry for the spam if this could have gone
> unsaid.
>
> ________________________________________
> From: Stephen Mallette <sp...@gmail.com>
> Sent: Wednesday, December 2, 2015 6:55 AM
> To: dev@tinkerpop.incubator.apache.org
> Subject: Re: [DISCUSS] Add native CSV loading support for gremlin
> (GraphReader)
>
> Thanks for bringing this up for discussion and offering to work on it. You
> don't make mention of how you will deal with data types - will you have
> some way to give users some fine-grained control of that?
>
>
>
>
>
> On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <al...@gmail.com>
> wrote:
>
> > Adding support for loading CSV into a graph using Gremlin's GraphReader
> > will lower the entry barrier for new users. A lot of data is already in
> CSV
> > format and a lot of existing databases/repositories allow users to export
> > their data as CSV.
> >
> > I'd like to add this capability to the gremlin core as a new GraphReader
> > instance. Since the CSV data doesn't map directly to nodes and vertexes,
> > I'm planning to do the loading on two steps:
> >
> > *Nodes*
> > The first is to load a CSV as vertex CSV file. I'll create a node for
> every
> > line in the csv and a property for each column on that line. If the csv
> has
> > column headers, then the names of the columns will be the names of the
> > corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
> > (There are other ways to do it as well, but I'm just trying to show the
> > general idea)
> >
> > *Edges*
> > The second step is loading the edges csv file which will be in the
> > following format
> >
> > vertex1 prop name (source vertex), vertex2 prop name (destination
> vertex),
> > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
> >
> > For each line in the edge csv file, the reader will search for a vertex
> > with the vertex1 prop value (caller need to ensure it's unique) to find
> the
> > source vertex, search for a destination vertex with destination prop
> value
> > and then create an edge that ties the two together. We will be creating
> an
> > edge property for each additional property on the line.
> >
> > Thoughts?
> >
> > Alaa
> >
>

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Posted by Dylan Bethune-Waddell <dy...@mail.utoronto.ca>.

I wrote a command line utility in Groovy that would do this for Titan -
here's how it worked:

1) Either a file or directory path for vertices/edges was passed.
2) Optional regex for extracting the vertex label from the file name(s).
    - Default is to split on underscores/dash/whitespace and take 
      element [0] (the label in the file would give more flexibility).
    - These files are batched according to available processors.
    - A transaction was opened to load each file from each batch.
3) Vertices - 1st column as id property, remaining additional props.
    - Should just be selection of the desired named/positional column.
    - The user should be able to provide an id mapping file:
       a. Restricts ids they care to load by the mapped-to ids.
       b. Shows coverage of their intended id conversion over file lines.
4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
    - Should also be generalized to selection/configuration by user.
5) Type - append after colon to the column header e.g. "name:int"
    - Type is often inferred from the first hundred lines of the file.
    - But when inconsistencies are further along than that, ugh.

I never managed to get the "interactive" part working before I moved 
on from this, but I think it's essential as the user should not have to 
hack on the CSV data much to get it to load. My idea was displaying
the file headers, getting the user to mark which has the "identifier" 
(for Titan was just a property key under a unique index), asking them
if they have a map file for that identifier, and finally asking them to 
confirm the types we inferred based on the first 100 lines or a sampling 
of lines or whatever with an option to "just do it already". Then, if the
user is trying to load a gazillion CSV files from a directory or set of
directories, we just ask them for "profiles" like this to apply per directory,
per file name matching some regex or criteria about its n x m column
shape, or something else to distinguish multiple files from each other.
Same general thing applies to edges. Of course, all this should be
possible to tuck away in a configuration file, or provide as arguments
to a "builder" in the REPL somehow - I think that could get confusing
fast, but with similar hand-holding to the above it could be workable.

For parsing the file, I think it needs reasonable defaults but like most 
CSV parsing frameworks, provide the option to change the quote 
character, line terminator, delimiter, skip n lines at the front, n lines at
the back, and all that stuff.

Hope that helps somewhat - sorry for the spam if this could have gone
unsaid.

________________________________________
From: Stephen Mallette <sp...@gmail.com>
Sent: Wednesday, December 2, 2015 6:55 AM
To: dev@tinkerpop.incubator.apache.org
Subject: Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Thanks for bringing this up for discussion and offering to work on it. You
don't make mention of how you will deal with data types - will you have
some way to give users some fine-grained control of that?

On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <al...@gmail.com> wrote:

> Adding support for loading CSV into a graph using Gremlin's GraphReader
> will lower the entry barrier for new users. A lot of data is already in CSV
> format and a lot of existing databases/repositories allow users to export
> their data as CSV.
>
> I'd like to add this capability to the gremlin core as a new GraphReader
> instance. Since the CSV data doesn't map directly to nodes and vertexes,
> I'm planning to do the loading on two steps:
>
> *Nodes*
> The first is to load a CSV as vertex CSV file. I'll create a node for every
> line in the csv and a property for each column on that line. If the csv has
> column headers, then the names of the columns will be the names of the
> corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
> (There are other ways to do it as well, but I'm just trying to show the
> general idea)
>
> *Edges*
> The second step is loading the edges csv file which will be in the
> following format
>
> vertex1 prop name (source vertex), vertex2 prop name (destination vertex),
> bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
>
> For each line in the edge csv file, the reader will search for a vertex
> with the vertex1 prop value (caller need to ensure it's unique) to find the
> source vertex, search for a destination vertex with destination prop value
> and then create an edge that ties the two together. We will be creating an
> edge property for each additional property on the line.
>
> Thoughts?
>
> Alaa
>

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Posted by Stephen Mallette <sp...@gmail.com>.

Thanks for bringing this up for discussion and offering to work on it. You
don't make mention of how you will deal with data types - will you have
some way to give users some fine-grained control of that?





On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <al...@gmail.com> wrote:

> Adding support for loading CSV into a graph using Gremlin's GraphReader
> will lower the entry barrier for new users. A lot of data is already in CSV
> format and a lot of existing databases/repositories allow users to export
> their data as CSV.
>
> I'd like to add this capability to the gremlin core as a new GraphReader
> instance. Since the CSV data doesn't map directly to nodes and vertexes,
> I'm planning to do the loading on two steps:
>
> *Nodes*
> The first is to load a CSV as vertex CSV file. I'll create a node for every
> line in the csv and a property for each column on that line. If the csv has
> column headers, then the names of the columns will be the names of the
> corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
> (There are other ways to do it as well, but I'm just trying to show the
> general idea)
>
> *Edges*
> The second step is loading the edges csv file which will be in the
> following format
>
> vertex1 prop name (source vertex), vertex2 prop name (destination vertex),
> bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
>
> For each line in the edge csv file, the reader will search for a vertex
> with the vertex1 prop value (caller need to ensure it's unique) to find the
> source vertex, search for a destination vertex with destination prop value
> and then create an edge that ties the two together. We will be creating an
> edge property for each additional property on the line.
>
> Thoughts?
>
> Alaa
>