You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wenbo Hu (Jira)" <ji...@apache.org> on 2021/11/02 07:49:00 UTC
[jira] [Updated] (ARROW-14549) VectorSchemaRoot is not refreshed
when value is null
[ https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenbo Hu updated ARROW-14549:
-----------------------------
Description:
I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
But the following code, unexpected behaivor happens.
Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
|col_1|col_2|col_3|
|-------|--------|--------|
|1|abc|3.14|
|2|NULL|NULL|
As document suggests,
{quote}populated data over and over into the same VectorSchemaRoot in a stream of batches rather than creating a new VectorSchemaRoot instance each time.
{quote}
*JdbcToArrowConfig* is set to reuse root.
{code:java}
public void querySql(String query, QueryOption option) throws Exception {
try (final java.sql.Connection conn = connectContainer.getConnection();
final Statement stmt = conn.createStatement();
final ResultSet rs = stmt.executeQuery(query)
) {
// create config with reuse schema root and custom batch size from option
final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
.setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
while (iterator.hasNext()){ // retrieve result from iterator
final VectorSchemaRoot root = iterator.next(); option.getCallback().handleBatchResult(root);
root.allocateNew(); // it has to be allocate new
}
} catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
}
......
// batch_size is set to 1, then callback is called twice.
QueryOptions options = new QueryOption(1,
root -> {
// if printer is not set, get schema, write header
if (printer == null) {
final String[] headers = root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
printer = new CSVPrinter(writer, CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build());
}
final int rows = root.getRowCount();
final List<FieldVector> fieldVectors = root.getFieldVectors();
// iterate over rows
for (int i = 0; i < rows; i++) {
final int rowId = i;
final List<String> row = fieldVectors.stream().map(v -> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); printer.printRecord(row);
}
});
connection.querySql("SELECT * FROM test_db", options);
......
{code}
if `root.allocateNew()` is called, the csv file is expected,
```
column_1,column_2,column_3
1,abc,3.14
2,null,null
```
Otherwise, null values of 2nd row are remaining the same values of 1st row
```
column_1,column_2,column_3
1,abc,3.14
2,abc,3.14
```
**Question: Is expected to call `allocateNew` every time when the schema root is reused?**
By without reusing schemaroot, the following code works as expected.
{code:java}
public void querySql(String query, QueryOption option) throws Exception {
try (final java.sql.Connection conn = connectContainer.getConnection();
final Statement stmt = conn.createStatement();
final ResultSet rs = stmt.executeQuery(query)) {
// create config without reuse schema root and custom batch size from option
final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
.setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
while (iterator.hasNext()) {
// retrieve result from iterator
try (VectorSchemaRoot root = iterator.next()) {
option.getCallback().handleBatchResult(root); root.allocateNew();
}
}
} catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }
}
{code}
was:
I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
But the following code, unexpected behaivor happens.
Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
|col_1|col_2|col_3|
|-------|--------|--------|
|1|abc|3.14|
|2|NULL|NULL|
As document suggests,
bq. populated data over and over into the same VectorSchemaRoot in a stream of batches rather than creating a new VectorSchemaRoot instance each time.
*JdbcToArrowConfig* is set to reuse root.
{code:java}
public void querySql(String query, QueryOption option) throws Exception {
try (final java.sql.Connection conn = connectContainer.getConnection();
final Statement stmt = conn.createStatement();
final ResultSet rs = stmt.executeQuery(query)
) {
// create config with reuse schema root and custom batch size from option
final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
.setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
while (iterator.hasNext()){ // retrieve result from iterator
final VectorSchemaRoot root = iterator.next(); option.getCallback().handleBatchResult(root);
root.allocateNew(); // it has to be allocate new
}
} catch (java.lang.Exception e)
{ throw new Exception(e.getMessage()); }
}
......
// batch_size is set to 1, then callback is called twice.
QueryOptions options = new QueryOption(1,
root -> {
// if printer is not set, get schema, write header
if (printer == null) \{ final String[] headers = root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new); printer = new CSVPrinter(writer, CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); }
final int rows = root.getRowCount();
final List<FieldVector> fieldVectors = root.getFieldVectors();
// iterate over rows
for (int i = 0; i < rows; i++) \{ final int rowId = i; final List<String> row = fieldVectors.stream().map(v -> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); printer.printRecord(row); }
});
connection.querySql("SELECT * FROM test_db", options);
......
{code}
if `root.allocateNew()` is called, the csv file is expected,
```
column_1,column_2,column_3
1,abc,3.14
2,null,null
```
Otherwise, null values of 2nd row are remaining the same values of 1st row
```
column_1,column_2,column_3
1,abc,3.14
2,abc,3.14
```
**Question: Is expected to call `allocateNew` every time when the schema root is reused?**
By without reusing schemaroot, the following code works as expected.
{code:java}
public void querySql(String query, QueryOption option) throws Exception {
try (final java.sql.Connection conn = connectContainer.getConnection();
final Statement stmt = conn.createStatement();
final ResultSet rs = stmt.executeQuery(query)
) {
// create config without reuse schema root and custom batch size from option
final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
.setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
while (iterator.hasNext()) {
// retrieve result from iterator
try (VectorSchemaRoot root = iterator.next()) \{ option.getCallback().handleBatchResult(root); root.allocateNew(); }
}
} catch (java.lang.Exception e) \{ throw new Exception(e.getMessage()); }
}
{code}
> VectorSchemaRoot is not refreshed when value is null
> ----------------------------------------------------
>
> Key: ARROW-14549
> URL: https://issues.apache.org/jira/browse/ARROW-14549
> Project: Apache Arrow
> Issue Type: Bug
> Components: Java
> Affects Versions: 6.0.0
> Reporter: Wenbo Hu
> Priority: Major
>
> I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
> But the following code, unexpected behaivor happens.
> Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
> |col_1|col_2|col_3|
> |-------|--------|--------|
> |1|abc|3.14|
> |2|NULL|NULL|
> As document suggests,
> {quote}populated data over and over into the same VectorSchemaRoot in a stream of batches rather than creating a new VectorSchemaRoot instance each time.
> {quote}
> *JdbcToArrowConfig* is set to reuse root.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
> try (final java.sql.Connection conn = connectContainer.getConnection();
> final Statement stmt = conn.createStatement();
> final ResultSet rs = stmt.executeQuery(query)
> ) {
> // create config with reuse schema root and custom batch size from option
> final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
> final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
> while (iterator.hasNext()){ // retrieve result from iterator
> final VectorSchemaRoot root = iterator.next(); option.getCallback().handleBatchResult(root);
> root.allocateNew(); // it has to be allocate new
> }
> } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
> }
>
> ......
> // batch_size is set to 1, then callback is called twice.
> QueryOptions options = new QueryOption(1,
> root -> {
> // if printer is not set, get schema, write header
> if (printer == null) {
> final String[] headers = root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
> printer = new CSVPrinter(writer, CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build());
> }
>
> final int rows = root.getRowCount();
> final List<FieldVector> fieldVectors = root.getFieldVectors();
>
> // iterate over rows
> for (int i = 0; i < rows; i++) {
> final int rowId = i;
> final List<String> row = fieldVectors.stream().map(v -> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); printer.printRecord(row);
> }
> });
>
> connection.querySql("SELECT * FROM test_db", options);
> ......
> {code}
> if `root.allocateNew()` is called, the csv file is expected,
> ```
> column_1,column_2,column_3
> 1,abc,3.14
> 2,null,null
> ```
> Otherwise, null values of 2nd row are remaining the same values of 1st row
> ```
> column_1,column_2,column_3
> 1,abc,3.14
> 2,abc,3.14
> ```
> **Question: Is expected to call `allocateNew` every time when the schema root is reused?**
> By without reusing schemaroot, the following code works as expected.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
> try (final java.sql.Connection conn = connectContainer.getConnection();
> final Statement stmt = conn.createStatement();
> final ResultSet rs = stmt.executeQuery(query)) {
> // create config without reuse schema root and custom batch size from option
> final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(new RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
>
> final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs, config);
> while (iterator.hasNext()) {
> // retrieve result from iterator
> try (VectorSchemaRoot root = iterator.next()) {
> option.getCallback().handleBatchResult(root); root.allocateNew();
> }
> }
> } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)