You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@openwhisk.apache.org by GitBox <gi...@apache.org> on 2018/06/07 17:50:28 UTC

[GitHub] maximann opened a new issue #269: Binary Encoding causes Corruption

maximann opened a new issue #269: Binary Encoding causes Corruption
URL: https://github.com/apache/incubator-openwhisk-package-kafka/issues/269
 
 
   It looks to me like https://github.com/apache/incubator-openwhisk-package-kafka/blob/449bbae13e813ba4dcd11dc33f47ab29d5e3541a/provider/consumer.py#L455 encodes all values as UTF-8 before base64. My understanding is that some byte values are not valid UTF-8, thus corruption can occur at the line above. ISO-8859-1 seems like the better choice. I'd be happy to file a bug if this seems like a reasonable hypothesis and if someone could point me to where to do that. I was also wondering if someone has encountered a similar issue before, especially since binary serializers such as Avro are popular for Kafka.
   
   Slack Transcript:
   id you try the with the parameter “isBinaryValue” set to true ?
   
   
   Max Manndorff [18 hours ago]
   Yes
   
   
   csantanapr [18 hours ago]
   I believe the intention is to take the binary data encode base64 to be able to be sent to the controller http rest api
   
   csantanapr [18 hours ago]
   The controller will take the base64 in the json, passed it down thru kafka, invoker sends the base64 data in a http post request to the function runtime, the function receives the application/json with the this field base64 data
   
   Max Manndorff [18 hours ago]
   Right, that makes sense. But my understanding is all of this happens after binary data was consumed in the provider code linked above, so the corrupted data gets passed along just fine and makes it to the invocation correctly. But when binary messages are consumed from the external kafka cluster (the cluster which the trigger is setup for), and encoded as a UTF-8 strings before being passed along to the controller, the corruption has already happened. Or at least that's what I think is happening.
   
   csantanapr [17 hours ago]
   I think we saw a problem in practice we found messages had corrupted data and the line 456 is what tries to attempt to clean
   
   
   Max Manndorff [17 hours ago]
   Definitely seeing that in my logs:
   ```[2018-06-06T23:24:13.040Z] [WARNING] [??] [kafkatriggers] [/core/qmpTrigger] Value contains non-unicode bytes. Replacing invalid bytes.```
   
   
   Max Manndorff [17 hours ago]
   If you're ok with it I'll try to fix this for binary formats such as Avro tomorrow and send a diff your way.
   
   csantanapr [17 hours ago]
   You mean don’t do encode utf-8 when isBinaryValue=true ?
   
   csantanapr [17 hours ago]
   My concern is current users how are they impacted in production with the change
   
   csantanapr [17 hours ago]
   When they proceess the input value in the function and they decode base64 they get UTF-8 data and work on that data. (edited)
   
   csantanapr [17 hours ago]
   With this change then their value data will not be UTF-8 and they functions would start to fail unexpectedly?
   
   csantanapr [17 hours ago]
   I’m all in for improving the code but no throwing the current customers under the bus, oops I meant users :smiley_cat:
   
   csantanapr [17 hours ago]
   Open and issue on the kafka repo and submit a PR so we can discuss more in depth the details with others
   
   Max Manndorff [17 hours ago]
   Sounds good, will do
   
   Max Manndorff [17 hours ago]
   Thanks
   
   Max Manndorff [17 hours ago]
   (I agree, valid concerns) (edited)
   
   
   csantanapr [17 hours ago]
   Maybe we need an extra parameter, or maybe is compatible or we try one and then other one think about it
   
   
   csantanapr [17 hours ago]
   Or deprecate and use new method, or handle triggers already created leave them alone etc...
   
   csantanapr [17 hours ago]
   Those are the type of things I also want to include as part of the fix and companion tests how roll it out to production
   
   csantanapr [17 hours ago]
   This is Awesome that people are coming forward to contribute to event feeds +1 cc @dubee
   
   Max Manndorff [17 hours ago]
   FWIW, I doubt that true binary works at all. The issue is explained here as well: https://haacked.com/archive/2012/01/30/hazards-of-converting-binary-data-to-a-string.aspx/
   You’ve Been Haacked
   Hazards of Converting Binary Data To A String
   Back in November, someone asked a question on StackOverflow about converting arbitrary binary data (in the form of a byte array) to a string. I know this because I make it a habit to read randomly selected questions in StackOverflow written in November 2011. Questions about text encodings in particular really turn me on.
   Jan 29th, 2012 at 4:00 PM
   (edited)
   
   
   csantanapr [17 hours ago]
   You might be correct
   
   
   csantanapr [17 hours ago]
   We just need something that can be safely pass thru http application/json
   
   csantanapr [17 hours ago]
   That’s why my explanation on all the touch points that data gets touch and transform
   
   csantanapr [17 hours ago]
   In the issue we can ping @berstler he was the original author that dealt with binary use case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services