You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by "benj.dev" <be...@laposte.net.INVALID> on 2019/03/15 19:31:50 UTC

regexp_replace function and MultiEncoding problem

Hi,

I have a source (.csv) with multi-encoding (it's [bs]ad but can't change
that).
When I try to apply a regexp_replace on a field
(like...regexp_replace(`myfield`,'...','...')...) I get an error
- Error: SYSTEM ERROR: MalformedInputException: Input length = 1

For example, I have a case due to a "รถ" encoding in ISO-8859-1 (\xF6) in
the .csvh
When Drill try to apply the regexp_replace(), as it work in UTF-8 it
probably say (oh, byte between F0 and FF, so it's a UTF-8 4 bytes
sequence (but "unfortunatly" next bytes are normal characters so the
second byte is no 10xxxxxx, so it's not a valid UTF-8

I can't convert explicitly the file from ISO-8859-1 to UTF-8 because
some line could be in ISO-8859-1 other in ISO-8859-5 or any existing
encoding (single byte, multi-bytes or variable length)
I don't want to eliminate "problematic" characters because I hope
sometimes an human can decide or be helped by this info.

So is there any way to use regexp_replace function without any error
typically use regexp_replace in US-ASCII mode ? (like a LC_ALL=C sed ...)
Or an option to continue even if error exists
Or a drill function that detect invalid UTF-8 sequence and can prevent
the apply of the regexp_replace on this string

Thanks for any idea,