You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/13 18:13:57 UTC

[GitHub] [arrow] GavinRay97 opened a new issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

GavinRay97 opened a new issue #12618:
URL: https://github.com/apache/arrow/issues/12618


   Based on feedback from mailing list thread here:
   - https://lists.apache.org/thread/852btc8tg5gyxglzkrmddts237fpwk8y
   
   The idea being a higher-level API wrapping `VectorSchemaRoot` and `FieldVector` that use Java objects and a row-oriented style for familiarity.
   
   Along with some utilities for manipulating `DataFrame`'s (IE, combine rows from multiple frames with the same schema, convert to a FlightSQL "GetTables" `Schema` object, etc).
   
   I believe this would be tremendously valuable.
   
   Below is an example of a quickly-thrown-together rough idea, just to get the conversation started:
   - Full code available at this gist: https://gist.github.com/GavinRay97/c0434574b4516f55da1eebfd4c1519b6
   - This code is probably pretty poor and likely doesn't follow Arrow best-practices
   
   ## Example Usage
   
   ```java
   class DataFrameTest {
       public static void main(String[] args) {
           DataFrame df = DataFrame.create();
   
           df.addColumn("name", MinorType.VARCHAR, false);
           df.addColumn("age", MinorType.INT, false);
           df.addColumn("weight", MinorType.FLOAT4, false);
   
           df.addRow(Map.of("name", "Alice", "age", 21, "weight", 50.0));
           df.addRow(Map.of("name", "Bob", "age", 30, "weight", 60.0));
   
           System.out.println("======= User DataFrame -> VectorSchemaRoot (TSV) =======");
           VectorSchemaRoot root = df.toArrowVectorSchemaRoot();
           System.out.println(root.contentToTSVString());
           assert (root.getRowCount() == 2) : "Expected 2 rows";
           assert (root.getSchema().getFields().size() == 3) : "Expected 3 columns";
   
           DataFrame roundtrip = DataFrame.fromArrowVectorSchemaRoot(root);
           assert (df.equals(roundtrip)) : "DataFrame equality failed";
   
           System.out.println("======= Roundtrip (DF -> VectorSchemaRoot -> DF) =======");
           System.out.println(roundtrip + "\n");
   
           System.out.println("======= FlightSQL GetTables Schema =======");
           VectorSchemaRoot flightSchema = new FlightSQLGetTablesSchemaPOJO(
                   "catalog1", "schema1", "users", "TABLE", df)
                   .toArrowVectorSchemaRoot();
           System.out.println(flightSchema.contentToTSVString());
   
           System.out.println("======= Merge DataFrames =======");
           DataFrame df3 = DataFrame.mergeDataFrames(true, df, roundtrip);
           System.out.println(df3.toArrowVectorSchemaRoot().contentToTSVString());
           assert (df3.rows().size() == df.rows().size() + roundtrip.rows().size()) : "Merge DataFrame failed";
       }
   }
   ```
   
   ## Output
   
   ```java
   ======= User DataFrame -> VectorSchemaRoot (TSV) =======
   name	age	weight
   Alice	21	50.0
   Bob	30	60.0
   
   ======= Roundtrip (DF -> VectorSchemaRoot -> DF) =======
   DataFrame[
     columns=[name: Utf8 not null, age: Int(32, true) not null, weight: FloatingPoint(SINGLE) not null],
     rows=[{name=Alice, weight=50.0, age=21}, {name=Bob, weight=60.0, age=30}]
   ]
   
   ======= FlightSQL GetTables Schema =======
   catalog_name	table_schema	db_schema_name	table_name	table_type
   catalog1	[B@4bdeaabb	schema1	users	TABLE
   
   ======= Merge DataFrames =======
   name	age	weight
   Alice	21	50.0
   Bob	30	60.0
   Alice	21	50.0
   Bob	30	60.0
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] GavinRay97 commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

GavinRay97 commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1072431572


   FWIW, there is already a pretty solid integration with JDBC `ResultSet` objects and conversion between JDBC types and Arrow types:
   
   - https://github.com/apache/arrow/blob/09497a976604c1960c5934e8f05dd8203700efd6/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrow.java
   
   > What does a high level dataframe API bring that JDBC does not already give us?
   
   I guess maybe the easiest way to answer this is with a code sample.
   
   Say that I have an Arrow FlightSQL service, and I need to respond to `GetTables` with a list of Arrow objects describing the tables in my schema.
   
   In the FlightSQL example, this is the implementation currently:
   
   - https://github.com/apache/arrow/blob/b4143d309c71c0247c056471a27fa7f03034cc76/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L436-L530
   
   This is a personal opinion so I probably wouldn't count this as an argument, but I found this really hard to approach from an understanding perspective.
   
   But on a more realistic level -- what if you wanted to return a set of table descriptions from data that wasn't a JDBC `ResultSet`?
   
   Something like the below seems a lot more easy/approachable and versatile (to me) at the cost of being less performant:
   
   ```java
   record FlightSQLGetTablesSchemaPOJO(String catalogName, String schemaName, String tableName, String tableType,
                                       DataFrame dataFrame) {
       public VectorSchemaRoot toArrowVectorSchemaRoot() {
           DataFrame.Builder builder = DataFrame.builder();
           builder.addColumn("catalog_name", MinorType.VARCHAR, false);
           builder.addColumn("db_schema_name", MinorType.VARCHAR, false);
           builder.addColumn("table_name", MinorType.VARCHAR, false);
           builder.addColumn("table_type", MinorType.VARCHAR, false);
           builder.addColumn("table_schema", MinorType.VARBINARY, false);
   
           Map<String, Object> row = new HashMap<>();
           row.put("catalog_name", catalogName);
           row.put("db_schema_name", schemaName);
           row.put("table_name", tableName);
           row.put("table_type", tableType);
           row.put("table_schema", new Schema(dataFrame.columns()).toByteArray());
   
           builder.addRow(row);
           DataFrame df = builder.build();
   
           return df.toArrowVectorSchemaRoot();
       }
   }
   
   public void getStreamTables(FlightSql.CommandGetTables command, CallContext context,
           ServerStreamListener listener) {
       try {
           DataFrame userSchema = DataFrame.builder()
                   .addColumn("id", Types.MinorType.INT.getType(), false)
                   .addColumn("name", Types.MinorType.VARCHAR.getType(), false)
                   .build();
   
           DataFrame todoSchema = DataFrame.builder()
                   .addColumn("id", Types.MinorType.INT.getType(), false)
                   .addColumn("description", Types.MinorType.VARCHAR.getType(), false)
                   .addColumn("completed", Types.MinorType.BIT.getType(), false)
                   .build();
   
           FlightSQLGetTablesSchemaPOJO userTable = new FlightSQLGetTablesSchemaPOJO(
                   "catalog1", "schema1", "user", "TABLE", userSchema);
   
           FlightSQLGetTablesSchemaPOJO todoTable = new FlightSQLGetTablesSchemaPOJO(
                   "catalog1", "schema1", "todo", "TABLE", todoSchema);
   
           VectorSchemaRoot userVectorSchema = userTable.toVectorSchemaRoot();
           VectorSchemaRoot todoVectorSchema = todoTable.toVectorSchemaRoot();
   
           VectorSchemaRoot merged = DataFrame
                   .mergeDataFrames(true,
                           DataFrame.fromVectorSchemaRoot(userVectorSchema),
                           DataFrame.fromVectorSchemaRoot(todoVectorSchema))
                   .toVectorSchemaRoot();
   
           listener.start(merged);
           listener.putNext();
       } catch (Exception e) {
           listener.error(e);
           e.printStackTrace();
       } finally {
           listener.completed();
       }
   }
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1071331036






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

emkornfield commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1072077097


   @westonpace I think these are good ideas but I'm not sure they seem slightly orthogonal to what @GavinRay97 wants the programming model to be (I do think these are useful).  In particular is you have something that implements JDBC on top of VectorSchemaRoot's it seems like you could potentially get the ORM for free with the right metadata?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] GavinRay97 commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

GavinRay97 commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1070960313


   Hey Micah, thanks for taking the time to leave a thorough set of comments.
   
   > Would the intention be to have a mapping for all Arrow data types from a java object? I think some of the existing getObject calls don't return the optimal types would the intention be to follow those mappings when possible?
   
   Take all of my answers to this and following questions with a grain of salt (I'm deeply unfamiliar with Arrow), but -- yes, where possible.
   
   I know that some Arrow types may not map well to JVM primitives, unsure what the best-case to do there is (maybe raw bytes?). But otherwise yes, whatever is the best-fit/most optimal Arrow -> JVM type mapping is the hope. I just don't know enough about Arrow to be a good judge of what that is at the moment.
   
   > I'm hesitant create a class named Dataframe in the project just for easy conversion back and forth between tuples. I think DataFrames come with a lot of expectations and in particular it seems like the canonical memory representation here seems to be row-based on-heap objects, I would expect an implementation to use a columnar representation (and at least use the concept of Vectors for columns even if VectorSchemaRoot isn't used).
   
   This is fair. I had originally implemented this in my own project as `Table` since it represents row-based/tabular data, but I thought that might be too confusing. 
   
   Not sure what the best naming convention here is. But I do agree, it should be something that conveys that the data is non-columnar and there is a loss of efficiency.
   
   > I started a mailing list discussion on minimum Java version, but I believe we should be targetting at most JDK 11 for the time being.
   
   Also agreed, in this case I used `record` just for brevity's sake to avoid boilerplate in the code
   
   > for conversion from strings you need to pass UTF_ENCODING to avoid brittleness in conversion.
   
   Noted 👍 
   
   > I think trying to implement this in the pattern [Loader](https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/VectorLoader.html) and [Unloader](https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/VectorUnloader.html). Maybe a new interface like VectorRowLoader and VectorRowUnloader? If the goal is to interface well with flight I think this might be the most ergonomic.
   
   Your judgement is better than mine -- I might need a bit of guidance on how to do this/the overall approach though.
   
   > This probably belongs in a new contrib module, but I think this would lower the barrier for entry, so if you are willing to contribute something I'd be willing to help review.
   
   Sure, I think it'd be a valuable addition and I love to contribute to OSS in an impactful way when I can.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1072239792


   @emkornfield I think you are right that implementing JDBC would get you these APIs (and ORMs) for free.  I suppose then I'm left wondering what the goal is here.  What does a high level dataframe API bring that JDBC does not already give us?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

emkornfield commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1072881375


   > @emkornfield I think you are right that implementing JDBC would get you these APIs (and ORMs) for free. I suppose then I'm left wondering what the goal is here. What does a high level dataframe API bring that JDBC does not already give us?
   
   @westonpace In addition to Gavin's response, I guess I'm advocating against calling it a dataframe precisely because it is overloaded.  In my mind this is simply the different between trying to load arbitrary tuples into the arrow format vs having a structured class you are going to load.  As an example we've spent a decent amount of time making it ergonomic in python to create Tables from python objects, I think the goal of this is something similar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1072905534


   I still think this is a variation on `PreparedStatement` though I agree it is more friendly.  For example, I don't see a lot of difference between:
   
   ```
       public VectorSchemaRoot toArrowVectorSchemaRoot() {
           DataFrame.Builder builder = DataFrame.builder();
           builder.addColumn("catalog_name", MinorType.VARCHAR, false);
           builder.addColumn("db_schema_name", MinorType.VARCHAR, false);
           builder.addColumn("table_name", MinorType.VARCHAR, false);
           builder.addColumn("table_type", MinorType.VARCHAR, false);
           builder.addColumn("table_schema", MinorType.VARBINARY, false);
   
           Map<String, Object> row = new HashMap<>();
           row.put("catalog_name", catalogName);
           row.put("db_schema_name", schemaName);
           row.put("table_name", tableName);
           row.put("table_type", tableType);
           row.put("table_schema", new Schema(dataFrame.columns()).toByteArray());
   
           builder.addRow(row);
           DataFrame df = builder.build();
   
           return df.toArrowVectorSchemaRoot();
       }
   ```
   
   ...and...
   
   ```
       public VectorSchemaRoot toArrowVectorSchemaRoot() {
           // DataFrame.Builder builder = DataFrame.builder();
           PreparedStatement ps = GetArrowPreparedStatement();
           // row.put("catalog_name", catalogName);
           ps.setString("catalog_name", catalogName);
           // row.put("db_schema_name", schemaName);
           ps.setString("db_schema_name", schemaName);
           // row.put("table_name", tableName);
           ps.setString("table_name", tableName);
           // row.put("table_type", tableType);
           ps.setString("table_type", tableType);
           // row.put("table_schema", new Schema(dataFrame.columns()).toByteArray());
           ps.setBytes("table_schema", new Schema(dataFrame.columns()).toByteArray());
           // builder.addRow(row);
           ps.addBatch();
           // return df.toArrowVectorSchemaRoot();
           return PreparedStatementToVectorSchemaRoot(ps);
       }
   ```
   
   I will definitely agree that the API you've proposed uses more domain appropriate terminology, and I think that alone is worth the effort, but I still think you're solving pretty much the same problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

emkornfield commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1072077097


   @westonpace I think these are good ideas but I'm not sure they seem slightly orthogonal to what @GavinRay97 wants the programming model to be (I do think these are useful).  In particular is you have something that implements JDBC on top of VectorSchemaRoot's it seems like you could potentially get the ORM for free with the right metadata?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] GavinRay97 commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

GavinRay97 commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1070960313






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

emkornfield commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1070329458

Looking through this at a high-level (I think I might have already mentioned some of this on the mailing list) but here are a few comments:
0. I think having easy conversion from a map based Rows to a VectorSchemaRoot is valuable. Would the intention be to have a mapping for all Arrow data types from a java object? I think some of the existing getObject calls don't return the optimal types would the intention be to follow those mappings when possible?
1. I'm hesitant create a class named Dataframe in the project just for easy conversion back and forth between tuples. I think DataFrames come with a lot of expectations and in particular it seems like the canonical memory representation here seems to be row-based on-heap objects, I would expect an implementation to use a columnar representation (and at least use the concept of Vectors for columns even if VectorSchemaRoot isn't used).
2. I started a mailing list discussion on minimum Java version, but I believe we should be targetting at most JDK 11 for the time being.
3. for conversion from strings you need to pass UTF_ENCODING to avoid brittleness in conversion.
4. I think trying to implement this in the pattern [Loader](https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/VectorLoader.html) and [Unloader](https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/VectorUnloader.html). Maybe a new interface like VectorRowLoader and VectorRowUnloader? If the goal is to interface well with flight I think this might be the most ergonomic.
5. This probably belongs in a new contrib module, but I think this would lower the barrier for entry, so if you are willing to contribute something I'd be willing to help review.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1071331036


   In the spirit of brainstorming, JDBC has some similar interfaces (not that anyone I've ever met particularly loves them).  For creating data there is `java.sql.PreparedStatement` (although bulk inserts are less common in the OLTP world):
   
   ```
       PreparedStatement ps = c.prepareStatement("INSERT INTO table VALUES (?, ?, ?)");
       ps.setString("name", "Alice");
       ps.setInt("age", 21);
       ps.setDouble("weight", 50.0);
       ps.addBatch();
       ps.clearParameters();
       ps.setString("name", "Bob");
       ps.setInt("age", 30);
       ps.setDouble("weight", 60.0);
       ps.addBatch();
       // Could presumably go from ps to VectorSchemaRoot somehow
   ```
   
   For reading data there is `java.sql.ResultSet`:
   
   ```
       ResultSet rs = ...; // From VectorSchemaRoot
       var name1 = rs.getString("name");
       var age1 = rs.getInt("age");
       var weight1 = rs.getDouble("weight");
       rs.next();
       var name2 = rs.getString("name");
       var age2 = rs.getInt("age");
       var weight2 = rs.getDouble("weight");
   ```
   
   Even if you don't use these directly it might be valuable to have compatibility with them, especially `java.sql.ResultSet`.  It's also an abstract class so I wonder if you could have a `java.sql.ResultSet` that is a zero-copy view of Arrow data.
   
   Although I believe JPA / Hibernate style APIs are probably more popular these days:
   
   ```
       List<People> people = GetPeople();
       VectorSchemaRoot root = VectorSchemaRootFromObjects(people);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org