You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by vincent gromakowski <vi...@gmail.com> on 2017/03/21 17:15:32 UTC

data cleaning and error routing

Hi,
In a context of ugly data, I am trying to find an efficient way to parse a
kafka stream of CSV lines into a clean data model and route lines in error
in a specific topic.

Generally I do this:
1. First a map to split my lines with the separator character (";")
2. Then a filter where I put all my conditions (number of fields...)
3. Then subtract the first with the second to get lines in error and save
it to a topic

Problem with this approach is that I cannot efficiently test the parsing of
String fields in other types like Int or Date. I would like to:
- test incomplete lines (arra length < x)
- test empty fields
- test field casting into Int, Long...
- some errors can be evicting, some aren't (use Try getOrElse ?)

How do you generally achieve this ? I cannot find any good data cleaning
example...