You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Bob Wakefield <ad...@hotmail.com> on 2016/05/09 16:14:37 UTC

is this an appropirate Avro use case?

I was watching a video presentation by Jay Kreps where we was talking about some data challenges he was dealing with that he solved with Avro. The thing is, he glosses over the details.

I am in a situation where I am dealing with having to ingest CSVs. The files are picked up by SSIS and imported into a data warehouse. My problem is the files are created by  a system that apparently isn’t stable. The developers of the system like to add columns without warning. What is particularly annoying is they can’t seem to decide how to represent negative numbers. Sometimes they have a negative sign which is fine. Sometimes numbers come in accounting notation with parentheses to denote negative numbers. That is not fine as SQL Server doesn’t understand that as a negative.

Can I somehow use Avro to ENSURE that the file from the third party system comes as something expected?


Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

Re: is this an appropirate Avro use case?

Posted by Akshay Aggarwal <ak...@flipkart.com>.
Hey,

Since you don't have control over the third party system, you can't ensure
that the data is as per your expectation when it arrives. But you can build
a data processing layer, to clean the raw data once it has arrived in the
warehouse, and store it in Avro with your own schema definition. When you
want to add a new field which has started flowing in / delete a field /
rename it, you can update the schema in a backward compatible manner (+
update the processing layer) and be assured that none of the downstream
systems will break.

Thanks,
Akshay Aggarwal

On Tue, May 10, 2016 at 6:28 AM Sean Busbey <bu...@cloudera.com> wrote:

> On Mon, May 9, 2016 at 12:21 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > you cannot use avro to ensure the data comes in the format you expect
> (the
> > negative numbers issue). you will have to parse these variations before
> > converting to avro.
>
> Unless, of course, you can get the folks sending you data to agree to
> send it in Avro. If you specifically get them to send the numbers
> coded as one of the number types in Avro (rather than i.e. a string),
> you'd be able to parse it the same way all of the time.
>
>
>
>
> --
> busbey
>

Re: is this an appropirate Avro use case?

Posted by Bob Wakefield <ad...@hotmail.com>.
Agreed.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Sam Groth 
Sent: Wednesday, May 11, 2016 11:11 AM
To: user@avro.apache.org 
Subject: Re: is this an appropirate Avro use case?

So there are 2 possible cases that I see: 1) You are able to get the data producer to switch to Avro using type int/double for the number fields. Then they would be forced to follow the types in the schema. 2) You write a data cleansing layer to fix inconsistencies and handle schema changes. In this case, I don't see any advantage to using Avro.


Sam 



On Wednesday, May 11, 2016 10:49 AM, Bob Wakefield <ad...@hotmail.com> wrote:




If I’ve been following properly it sounds like while the schema change would be handled, data cleansing would still have to be coded. I was thinking of converting from CSV to Avro but then I’d have to convert back to CSV to shove it into the database. I’m not opposed to doing that, I just don’t think it solves my problem with the negative numbers data type issue unless Avro understands (200) = –200.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: kppublicmail . 
Sent: Wednesday, May 11, 2016 10:35 AM
To: user@avro.apache.org 
Subject: Re: is this an appropirate Avro use case?

One another option is to convert CSV file to avro before being consumed.
Thanks.
On May 9, 2016 8:58 PM, "Sean Busbey" <bu...@cloudera.com> wrote:

  On Mon, May 9, 2016 at 12:21 PM, Koert Kuipers <ko...@tresata.com> wrote:
  > you cannot use avro to ensure the data comes in the format you expect (the
  > negative numbers issue). you will have to parse these variations before
  > converting to avro.


  Unless, of course, you can get the folks sending you data to agree to
  send it in Avro. If you specifically get them to send the numbers
  coded as one of the number types in Avro (rather than i.e. a string),
  you'd be able to parse it the same way all of the time.




  --
  busbey




Re: is this an appropirate Avro use case?

Posted by Sam Groth <sg...@yahoo-inc.com>.
So there are 2 possible cases that I see: 1) You are able to get the data producer to switch to Avro using type int/double for the number fields. Then they would be forced to follow the types in the schema. 2) You write a data cleansing layer to fix inconsistencies and handle schema changes. In this case, I don't see any advantage to using Avro.

Sam  

    On Wednesday, May 11, 2016 10:49 AM, Bob Wakefield <ad...@hotmail.com> wrote:
 

 If I’ve been following properly it sounds like while the schema change would be handled, data cleansing would still have to be coded. I was thinking of converting from CSV to Avro but then I’d have to convert back to CSV to shove it into the database. I’m not opposed to doing that, I just don’t think it solves my problem with the negative numbers data type issue unless Avro understands (200) = –200. Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData From: kppublicmail . Sent: Wednesday, May 11, 2016 10:35 AMTo: user@avro.apache.org Subject: Re: is this an appropirate Avro use case? One another option is to convert CSV file to avro before being consumed.Thanks.On May 9, 2016 8:58 PM, "Sean Busbey" <bu...@cloudera.com> wrote:

 On Mon, May 9, 2016 at 12:21 PM, Koert Kuipers <ko...@tresata.com> wrote:
> you cannot use avro to ensure the data comes in the format you expect (the
> negative numbers issue). you will have to parse these variations before
> converting to avro.

Unless, of course, you can get the folks sending you data to agree to
send it in Avro. If you specifically get them to send the numbers
coded as one of the number types in Avro (rather than i.e. a string),
you'd be able to parse it the same way all of the time.




--
busbey



  

Re: is this an appropirate Avro use case?

Posted by Bob Wakefield <ad...@hotmail.com>.
If I’ve been following properly it sounds like while the schema change would be handled, data cleansing would still have to be coded. I was thinking of converting from CSV to Avro but then I’d have to convert back to CSV to shove it into the database. I’m not opposed to doing that, I just don’t think it solves my problem with the negative numbers data type issue unless Avro understands (200) = –200.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: kppublicmail . 
Sent: Wednesday, May 11, 2016 10:35 AM
To: user@avro.apache.org 
Subject: Re: is this an appropirate Avro use case?

One another option is to convert CSV file to avro before being consumed.

Thanks.

On May 9, 2016 8:58 PM, "Sean Busbey" <bu...@cloudera.com> wrote:

  On Mon, May 9, 2016 at 12:21 PM, Koert Kuipers <ko...@tresata.com> wrote:
  > you cannot use avro to ensure the data comes in the format you expect (the
  > negative numbers issue). you will have to parse these variations before
  > converting to avro.


  Unless, of course, you can get the folks sending you data to agree to
  send it in Avro. If you specifically get them to send the numbers
  coded as one of the number types in Avro (rather than i.e. a string),
  you'd be able to parse it the same way all of the time.




  --
  busbey

Re: is this an appropirate Avro use case?

Posted by "kppublicmail ." <kp...@gmail.com>.
One another option is to convert CSV file to avro before being consumed.

Thanks.
On May 9, 2016 8:58 PM, "Sean Busbey" <bu...@cloudera.com> wrote:

On Mon, May 9, 2016 at 12:21 PM, Koert Kuipers <ko...@tresata.com> wrote:
> you cannot use avro to ensure the data comes in the format you expect (the
> negative numbers issue). you will have to parse these variations before
> converting to avro.

Unless, of course, you can get the folks sending you data to agree to
send it in Avro. If you specifically get them to send the numbers
coded as one of the number types in Avro (rather than i.e. a string),
you'd be able to parse it the same way all of the time.




--
busbey

Re: is this an appropirate Avro use case?

Posted by Sean Busbey <bu...@cloudera.com>.
On Mon, May 9, 2016 at 12:21 PM, Koert Kuipers <ko...@tresata.com> wrote:
> you cannot use avro to ensure the data comes in the format you expect (the
> negative numbers issue). you will have to parse these variations before
> converting to avro.

Unless, of course, you can get the folks sending you data to agree to
send it in Avro. If you specifically get them to send the numbers
coded as one of the number types in Avro (rather than i.e. a string),
you'd be able to parse it the same way all of the time.




-- 
busbey

Re: is this an appropirate Avro use case?

Posted by Koert Kuipers <ko...@tresata.com>.
you can use avro to handle columns being added without warning. you can
also use avro to handle column renames, etc.

you cannot use avro to ensure the data comes in the format you expect (the
negative numbers issue). you will have to parse these variations before
converting to avro.

On Mon, May 9, 2016 at 12:14 PM, Bob Wakefield <adaryl.wakefield@hotmail.com
> wrote:

> I was watching a video presentation by Jay Kreps where we was talking
> about some data challenges he was dealing with that he solved with Avro.
> The thing is, he glosses over the details.
>
> I am in a situation where I am dealing with having to ingest CSVs. The
> files are picked up by SSIS and imported into a data warehouse. My problem
> is the files are created by  a system that apparently isn’t stable. The
> developers of the system like to add columns without warning. What is
> particularly annoying is they can’t seem to decide how to represent
> negative numbers. Sometimes they have a negative sign which is fine.
> Sometimes numbers come in accounting notation with parentheses to denote
> negative numbers. That is not fine as SQL Server doesn’t understand that as
> a negative.
>
> Can I somehow use Avro to ENSURE that the file from the third party system
> comes as something expected?
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>