You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wenbo Hu (Jira)" <ji...@apache.org> on 2021/11/02 07:49:00 UTC

[jira] [Updated] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

     [ https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenbo Hu updated ARROW-14549:
-----------------------------
    Description: 
I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
 But the following code, unexpected behaivor happens.

Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
|col_1|col_2|col_3|
|-------|--------|--------|
|1|abc|3.14|
|2|NULL|NULL|

As document suggests,
{quote}populated data over and over into the same VectorSchemaRoot in a stream of batches rather than creating a new VectorSchemaRoot instance each time.
{quote}
*JdbcToArrowConfig* is set to reuse root.
{code:java}
public void querySql(String query, QueryOption option) throws Exception {
 try (final java.sql.Connection conn = connectContainer.getConnection();
     final Statement stmt = conn.createStatement();
     final ResultSet rs = stmt.executeQuery(query)
 ) {
 // create config with reuse schema root and custom batch size from option
     final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
     .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();

  final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
   while (iterator.hasNext()){ // retrieve result from iterator 
     final VectorSchemaRoot root = iterator.next(); option.getCallback().handleBatchResult(root); 
     root.allocateNew(); // it has to be allocate new 
   }
  } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
 }
 
 ......
 // batch_size is set to 1, then callback is called twice.
 QueryOptions options = new QueryOption(1, 
     root -> {
 // if printer is not set, get schema, write header
 if (printer == null) { 
      final String[] headers = root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new); 
      printer = new CSVPrinter(writer, CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); 
  }
 
 final int rows = root.getRowCount();
 final List<FieldVector> fieldVectors = root.getFieldVectors();
 
 // iterate over rows
 for (int i = 0; i < rows; i++) { 
      final int rowId = i; 
      final List<String> row = fieldVectors.stream().map(v -> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); printer.printRecord(row); 
  }
 });
 
 connection.querySql("SELECT * FROM test_db", options);
 ......
{code}
if `root.allocateNew()` is called, the csv file is expected,
 ```
 column_1,column_2,column_3
 1,abc,3.14
 2,null,null
 ```
 Otherwise, null values of 2nd row are remaining the same values of 1st row
 ```
 column_1,column_2,column_3
 1,abc,3.14
 2,abc,3.14
 ```

**Question: Is expected to call `allocateNew` every time when the schema root is reused?**

By without reusing schemaroot, the following code works as expected.
{code:java}
 public void querySql(String query, QueryOption option) throws Exception {
 try (final java.sql.Connection conn = connectContainer.getConnection();
     final Statement stmt = conn.createStatement();
     final ResultSet rs = stmt.executeQuery(query)) {
     // create config without reuse schema root and custom batch size from option
     final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
     .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
 
     final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
     while (iterator.hasNext()) {
     // retrieve result from iterator
     try (VectorSchemaRoot root = iterator.next()) { 
          option.getCallback().handleBatchResult(root); root.allocateNew(); 
      }
   }
 } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }

}

{code}

  was:
I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
 But the following code, unexpected behaivor happens.

Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
|col_1|col_2|col_3|
|-------|--------|--------|
|1|abc|3.14|
|2|NULL|NULL|

As document suggests,
bq. populated data over and over into the same VectorSchemaRoot in a stream of batches rather than creating a new VectorSchemaRoot instance each time. 
*JdbcToArrowConfig* is set to reuse root.


{code:java}
public void querySql(String query, QueryOption option) throws Exception {
 try (final java.sql.Connection conn = connectContainer.getConnection();
 final Statement stmt = conn.createStatement();
 final ResultSet rs = stmt.executeQuery(query)
 ) {
 // create config with reuse schema root and custom batch size from option
 final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
 .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();

final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
 while (iterator.hasNext()){ // retrieve result from iterator 
final VectorSchemaRoot root = iterator.next(); option.getCallback().handleBatchResult(root); 
root.allocateNew(); // it has to be allocate new 
}

} catch (java.lang.Exception e)

{ throw new Exception(e.getMessage()); }
 }
 
 ......
 // batch_size is set to 1, then callback is called twice.
 QueryOptions options = new QueryOption(1, 
 root -> {
 // if printer is not set, get schema, write header
 if (printer == null) \{ final String[] headers = root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new); printer = new CSVPrinter(writer, CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); }
 
 final int rows = root.getRowCount();
 final List<FieldVector> fieldVectors = root.getFieldVectors();
 
 // iterate over rows
 for (int i = 0; i < rows; i++) \{ final int rowId = i; final List<String> row = fieldVectors.stream().map(v -> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); printer.printRecord(row); }
 });
 
 connection.querySql("SELECT * FROM test_db", options);
 ......
{code}

 
 if `root.allocateNew()` is called, the csv file is expected,
 ```
 column_1,column_2,column_3
 1,abc,3.14
 2,null,null
 ```
 Otherwise, null values of 2nd row are remaining the same values of 1st row
 ```
 column_1,column_2,column_3
 1,abc,3.14
 2,abc,3.14
 ```
 
 **Question: Is expected to call `allocateNew` every time when the schema root is reused?**
 
 
 By without reusing schemaroot, the following code works as expected. 

{code:java}

 public void querySql(String query, QueryOption option) throws Exception {
 try (final java.sql.Connection conn = connectContainer.getConnection();
 final Statement stmt = conn.createStatement();
 final ResultSet rs = stmt.executeQuery(query)
 ) {
 // create config without reuse schema root and custom batch size from option
 final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
 .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
 
 final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
 while (iterator.hasNext()) {
 // retrieve result from iterator
 try (VectorSchemaRoot root = iterator.next()) \{ option.getCallback().handleBatchResult(root); root.allocateNew(); }
 }
 
 } catch (java.lang.Exception e) \{ throw new Exception(e.getMessage()); }

}

{code}


> VectorSchemaRoot is not refreshed when value is null
> ----------------------------------------------------
>
>                 Key: ARROW-14549
>                 URL: https://issues.apache.org/jira/browse/ARROW-14549
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 6.0.0
>            Reporter: Wenbo Hu
>            Priority: Major
>
> I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
>  But the following code, unexpected behaivor happens.
> Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
> |col_1|col_2|col_3|
> |-------|--------|--------|
> |1|abc|3.14|
> |2|NULL|NULL|
> As document suggests,
> {quote}populated data over and over into the same VectorSchemaRoot in a stream of batches rather than creating a new VectorSchemaRoot instance each time.
> {quote}
> *JdbcToArrowConfig* is set to reuse root.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>      final Statement stmt = conn.createStatement();
>      final ResultSet rs = stmt.executeQuery(query)
>  ) {
>  // create config with reuse schema root and custom batch size from option
>      final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>      .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
>   final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>    while (iterator.hasNext()){ // retrieve result from iterator 
>      final VectorSchemaRoot root = iterator.next(); option.getCallback().handleBatchResult(root); 
>      root.allocateNew(); // it has to be allocate new 
>    }
>   } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
>  }
>  
>  ......
>  // batch_size is set to 1, then callback is called twice.
>  QueryOptions options = new QueryOption(1, 
>      root -> {
>  // if printer is not set, get schema, write header
>  if (printer == null) { 
>       final String[] headers = root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new); 
>       printer = new CSVPrinter(writer, CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); 
>   }
>  
>  final int rows = root.getRowCount();
>  final List<FieldVector> fieldVectors = root.getFieldVectors();
>  
>  // iterate over rows
>  for (int i = 0; i < rows; i++) { 
>       final int rowId = i; 
>       final List<String> row = fieldVectors.stream().map(v -> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); printer.printRecord(row); 
>   }
>  });
>  
>  connection.querySql("SELECT * FROM test_db", options);
>  ......
> {code}
> if `root.allocateNew()` is called, the csv file is expected,
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,null,null
>  ```
>  Otherwise, null values of 2nd row are remaining the same values of 1st row
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,abc,3.14
>  ```
> **Question: Is expected to call `allocateNew` every time when the schema root is reused?**
> By without reusing schemaroot, the following code works as expected.
> {code:java}
>  public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>      final Statement stmt = conn.createStatement();
>      final ResultSet rs = stmt.executeQuery(query)) {
>      // create config without reuse schema root and custom batch size from option
>      final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>      .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
>  
>      final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>      while (iterator.hasNext()) {
>      // retrieve result from iterator
>      try (VectorSchemaRoot root = iterator.next()) { 
>           option.getCallback().handleBatchResult(root); root.allocateNew(); 
>       }
>    }
>  } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)