You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "jgill02 (via GitHub)" <gi...@apache.org> on 2023/04/18 06:43:51 UTC

[GitHub] [beam] jgill02 opened a new issue, #26316: [Bug]: Go SDK - avroio data emitting edge case when there are nil/null values

jgill02 opened a new issue, #26316:
URL: https://github.com/apache/beam/issues/26316

   ### What happened?
   
   Beam go sdk 2.46 - avroio package (with dataflow runner). 
   
   I believe that [this](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/avroio/avroio.go#L104) line contributes to a bit of an edge case issue in situations where there are nil values for columns in a .avro file (using OCF). For reference, I encountered this issue using a bigquery EXPORT DATA OPTIONS (to avro) command, and then reading the .avro file in GCS from a dataflow job.
   
   Specifically the below snippet:
   
   ```
   	val := reflect.New(f.Type.T).Interface()
   	for ar.Scan() {
   		var i any
   		i, err = ar.Read()
   		if err != nil {
   			log.Errorf(ctx, "error reading avro row: %v", err)
   			continue
   		}
   
   		// marshal interface to bytes
   		var b []byte
   		b, err = json.Marshal(i)
   		if err != nil {
   			log.Errorf(ctx, "error unmarshalling avro data: %v", err)
   			return
   		}
   
   		switch reflect.New(f.Type.T).Interface().(type) {
   		case *string:
   			emit(string(b))
   		default:
   			if err = json.Unmarshal(b, val); err != nil {
   				log.Errorf(ctx, "error unmashalling avro to type: %v", err)
   				return
   			}
   			emit(reflect.ValueOf(val).Elem().Interface())
   		}
   	}
   ```
   
   By declaring '**val**' outside of the scan() for loop, if a record (e.g. 'row 1') in a .avro file has a column (e.g. "foo") with a value (e.g. "bar"), and then in a subsequent record (e.g. 'row 2') the subsequent value for the "foo" column is nil (i.e. no data), then the bottom 'emit(reflect.ValueOf(**val**).Elem().Interface())' statement will emit "bar" for "foo" for "row 2", when instead {} (empty value) should be emitted for that column. 
   
   I believe this is because "val" is a pointer, and therefore if upon one invocation of the ar.Scan() / ar.Read() loop some of the elements are filled with data, if a subsequent datum in the avro file is nil for a given column, then that column won't be overridden with a nil value / the previous datum's value will be persisted in the final emit statement. 
   
   By moving the "val :=" declaration just above "var i any" / just below the "for ar.Scan() {" loop, this issue can be mitigated, though I acknowledge that may incur a bit of a performance penalty in doing so.
   
   For reference, the bigquery "export data options" statement (to avro) will have its exported avro files have an ocf header with nullable fields like so: "fields":[{"name": "foo", type":["null","string"],"default":null}]
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [X] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [X] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] jgill02 commented on issue #26316: [Bug]: Go SDK - avroio data emitting edge case when there are nil/null values

Posted by "jgill02 (via GitHub)" <gi...@apache.org>.
jgill02 commented on issue #26316:
URL: https://github.com/apache/beam/issues/26316#issuecomment-1528129596

   Hi Robert, thank you so much. I would be happy to and will work on getting that going. Thank you! - John


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] riteshghorse closed issue #26316: [Bug]: Go SDK - avroio data emitting edge case when there are nil/null values

Posted by "riteshghorse (via GitHub)" <gi...@apache.org>.
riteshghorse closed issue #26316: [Bug]: Go SDK - avroio data emitting edge case when there are nil/null values
URL: https://github.com/apache/beam/issues/26316


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on issue #26316: [Bug]: Go SDK - avroio data emitting edge case when there are nil/null values

Posted by "lostluck (via GitHub)" <gi...@apache.org>.
lostluck commented on issue #26316:
URL: https://github.com/apache/beam/issues/26316#issuecomment-1527954563

   I agree. Care to contribute that PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org