You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wenbo Hu (Jira)" <ji...@apache.org> on 2021/11/02 07:46:00 UTC

[jira] [Created] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

Wenbo Hu created ARROW-14549:
--------------------------------

             Summary: VectorSchemaRoot is not refreshed when value is null
                 Key: ARROW-14549
                 URL: https://issues.apache.org/jira/browse/ARROW-14549
             Project: Apache Arrow
          Issue Type: Bug
          Components: Java
    Affects Versions: 6.0.0
            Reporter: Wenbo Hu


I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
 But the following code, unexpected behaivor happens.

Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
|col_1|col_2|col_3|
|-------|--------|--------|
|1|abc|3.14|
|2|NULL|NULL|

As document suggests,
bq. populated data over and over into the same VectorSchemaRoot in a stream of batches rather than creating a new VectorSchemaRoot instance each time. 
*JdbcToArrowConfig* is set to reuse root.


{code:java}
public void querySql(String query, QueryOption option) throws Exception {
 try (final java.sql.Connection conn = connectContainer.getConnection();
 final Statement stmt = conn.createStatement();
 final ResultSet rs = stmt.executeQuery(query)
 ) {
 // create config with reuse schema root and custom batch size from option
 final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
 .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();

final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
 while (iterator.hasNext()){ // retrieve result from iterator 
final VectorSchemaRoot root = iterator.next(); option.getCallback().handleBatchResult(root); 
root.allocateNew(); // it has to be allocate new 
}

} catch (java.lang.Exception e)

{ throw new Exception(e.getMessage()); }
 }
 
 ......
 // batch_size is set to 1, then callback is called twice.
 QueryOptions options = new QueryOption(1, 
 root -> {
 // if printer is not set, get schema, write header
 if (printer == null) \{ final String[] headers = root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new); printer = new CSVPrinter(writer, CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); }
 
 final int rows = root.getRowCount();
 final List<FieldVector> fieldVectors = root.getFieldVectors();
 
 // iterate over rows
 for (int i = 0; i < rows; i++) \{ final int rowId = i; final List<String> row = fieldVectors.stream().map(v -> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); printer.printRecord(row); }
 });
 
 connection.querySql("SELECT * FROM test_db", options);
 ......
{code}

 
 if `root.allocateNew()` is called, the csv file is expected,
 ```
 column_1,column_2,column_3
 1,abc,3.14
 2,null,null
 ```
 Otherwise, null values of 2nd row are remaining the same values of 1st row
 ```
 column_1,column_2,column_3
 1,abc,3.14
 2,abc,3.14
 ```
 
 **Question: Is expected to call `allocateNew` every time when the schema root is reused?**
 
 
 By without reusing schemaroot, the following code works as expected. 

{code:java}

 public void querySql(String query, QueryOption option) throws Exception {
 try (final java.sql.Connection conn = connectContainer.getConnection();
 final Statement stmt = conn.createStatement();
 final ResultSet rs = stmt.executeQuery(query)
 ) {
 // create config without reuse schema root and custom batch size from option
 final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
 .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
 
 final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
 while (iterator.hasNext()) {
 // retrieve result from iterator
 try (VectorSchemaRoot root = iterator.next()) \{ option.getCallback().handleBatchResult(root); root.allocateNew(); }
 }
 
 } catch (java.lang.Exception e) \{ throw new Exception(e.getMessage()); }

}

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)