You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/13 16:28:07 UTC

[GitHub] [arrow] lwhite1 commented on issue #12618: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity.

lwhite1 commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1183432653

   I think this map-based, approach is wonderfully friendly for Java devs, as is the JDBC-like syntax.  I will just mention here a method that may be more GC friendly. Apologies if this takes the conversation in an unhelpful direction.
   
   In the Tablesaw dataframe, row-oriented access is provided by an object called a “Row” (although it should probably have been called a Curser).  The intent was to minimize the memory overhead of row-based access since instantiating real rows would cause memory to grow by some multiple of the original dataframe size.  The read operations look like this: 
   
   ```
     Table t = ....;
     for (Row row: t) {
        int age = row.getInt("age");          // no boxing
        String name = row.getString("name");  // retrieve from dictionary encoding
        // do whatever else you want to do.
     } 
   
   ```
   You can also access a row by index:
   
   ```
      Row r = t.row(43); 
   ```
   and move the index programmatically, if needed. 
   
   The advantage is that 
   - the row object is created only once. It just gets an index updated as it moves.
   - rows don't have a column name attribute for each column
   - the primitive values are accessed without boxing,
   - primitive encoded values like LocalDate, can either be retrieved as LocalDate objects or as encoded primitives, if you don't need the whole object. 
   - you can combine iteration with filtering and postpone/avoid instantiation of some objects until they're needed.  
   
   You can also update using the API: 
   ```
      for (Row row: t) {
         int age = row.getInt("age");
         if (age >= 18) {
            row.setBoolean("adult", true);
         }
      } 
   ```
   
   Row-based inserts are performed using the same API by asking the table for a new row:
   ```
      Row r = t.appendRow();     // adds a new 'cell' to every column in the table, and returns the row pointing to those cells
      r.setString("name", "Joe"); 
   ```
   
   The main disadvantages I see are that 
   - (a) the "Row" object cannot be safely passed around like a map; you need to use it in one thread and extract whatever data you need there. 
   - The operations are not as obvious to new users as a method based on returning maps. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org