You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samoa.apache.org by Gianmarco De Francisci Morales <gd...@apache.org> on 2016/01/31 13:45:43 UTC

Re: Avro Support for SAMOA

Thanks Jay,
I updated the wiki to include the links (
https://cwiki.apache.org/confluence/display/SAMOA/Sample+Avro+Datasets).
Unfortunately, the files are too big to attach directly in the wiki.

Cheers,

-- Gianmarco

On 25 December 2015 at 16:48, Jayadeep J <ja...@gmail.com> wrote:

> Hi Gianmarco,
>
> I have created a PR with the documentation for website docs.
>
> The zipped test data-sets are 2 files of 20 MB each for JSON & Binary. If
> you can attach it in wiki, then that is great. I don't have access to
> create a wiki page I guess. The links to download the files are below
>
>
> https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing
>
>
>
> https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing
>
>
> Thanks
> Jay
>
>
>
> On Mon, Nov 30, 2015 at 9:09 PM, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
> > Thanks Jayadeep,
> >
> > I think the docs could go in the website docs.
> > Not sure about the test datasets. Maybe as attachments in the wiki if
> they
> > are not too big?
> >
> > --
> > Gianmarco
> >
> > On 30 November 2015 at 14:38, Jayadeep J <ja...@gmail.com> wrote:
> >
> > > Hi Gianmarco,
> > >
> > > I have closed the PR
> > >
> > > Let me know where to put the instructions for using AVRO, Input format
> > > document & test data sets ???
> > >
> > >
> > >
> >
> https://drive.google.com/file/d/0B844rHJZHzKMdk5oMHZWREdxMnM/view?usp=sharing
> > >
> > >
> > >
> > >
> >
> https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing
> > >
> > >
> > >
> > >
> >
> https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing
> > >
> > >
> > > Thanks
> > > Jay
> > > https://github.com/jayadeepj
> > >
> > >
> > > On Thu, Nov 5, 2015 at 3:05 PM, Jayadeep J <ja...@gmail.com>
> wrote:
> > >
> > > > Hi Gianmarco,
> > > >
> > > > All the test instructions, test data & other details are updated on
> the
> > > > pull request
> > > >
> > > > Thanks
> > > > Jay
> > > > https://github.com/jayadeepj
> > > >
> > > > On Thu, Nov 5, 2015 at 12:50 PM, Gianmarco De Francisci Morales <
> > > > gdfm@apache.org> wrote:
> > > >
> > > >> Thanks Jay,
> > > >>
> > > >> I'll test it this weekend. Do you have some instructions and data I
> > > could
> > > >> use to try it out?
> > > >>
> > > >> --
> > > >> Gianmarco
> > > >>
> > > >> On 4 November 2015 at 16:47, Jayadeep J <ja...@gmail.com>
> wrote:
> > > >>
> > > >> > Hi Gianmarco,
> > > >> >
> > > >> > I have implemented this functionality as per the suggestions and
> > have
> > > >> > raised a pull request.
> > > >> >
> > > >> > The implementation details are as below.
> > > >> >
> > > >> > 1) A new AvroFileStream as a subclass of existing FileStream that
> > will
> > > >> take
> > > >> > in the encoding format (json/binary) from command-line. It will
> use
> > > >> > InputStream  instead of current io Reader to handle Binary
> Streams.
> > > >> > 2) A common Loader interface to make the parsing of streams
> generic
> > > >> rather
> > > >> > than only ARFF
> > > >> > 3) A new AvroLoader abstract class in samoa-instances that will
> > handle
> > > >> the
> > > >> > parsing of the Avro Generic Records from InputStream into SAMOA
> > > >> instances.
> > > >> > If even one attribute in the Avro schema has a null union
> (nullable
> > > >> > attribute) then it will be converted into  a SAMOA Sparse Instance
> > > else
> > > >> > DenseInstance
> > > >> > 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e.
> > > >> > AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data &
> > Avro
> > > >> > schema on initialization. They will use separate decoders to read
> > from
> > > >> the
> > > >> > stream
> > > >> > 5) Appropriate changes in poms , Instances.java & ARFFLoader to
> use
> > > the
> > > >> new
> > > >> > Loader interface
> > > >> >
> > > >> > Though I have seen that the Travis build has failed. Couldn't see
> > from
> > > >> the
> > > >> > logs if it is due to this code change
> > > >> >
> > > >> > Thanks
> > > >> > Jay
> > > >> > https://github.com/jayadeepj
> > > >> >
> > > >> > On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales <
> > > >> > gdfm@apache.org> wrote:
> > > >> >
> > > >> > > Hi Jay,
> > > >> > >
> > > >> > > 1) I agree custom data types would be overkill.
> > > >> > > I was thinking of the second option you mentioned,
> distinguishing
> > it
> > > >> > > inside the code.
> > > >> > > So the parser code would expect either all values to be
> optional,
> > or
> > > >> all
> > > >> > > values to be required.
> > > >> > >
> > > >> > > I think the plan you have in mind is quite reasonable.
> > > >> > > I don't have other suggestions right now.
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > --
> > > >> > > Gianmarco
> > > >> > >
> > > >> > > On 21 October 2015 at 11:39, Jayadeep J <ja...@gmail.com>
> > > wrote:
> > > >> > >
> > > >> > >> Hi Gianmarco,
> > > >> > >>
> > > >> > >> Thanks for your reply. Regarding the points you mentioned,
> > > >> > >>
> > > >> > >> 1) W.r.t  Sparse & Dense instances, I am trying to understand
> > what
> > > >> you
> > > >> > >> meant by "prototypes". Did you mean creating custom Avro data
> > types
> > > >> like
> > > >> > >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If
> yes,
> > > the
> > > >> > actual
> > > >> > >> data stored in the file (JSON encoded) may become heavy. For
> e.g
> > > for
> > > >> the
> > > >> > >> iris data-set, if we decide to use a 'SparseNumeric' type for
> > > >> > >> 'sepallength',
> > > >> > >>
> > > >> > >> {"name":
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},
> > > >> > >>
> > > >> > >> the data may look like this,
> > > >> > >>
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
> > > >> > >>
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
> > > >> > >>
> > > >> > >> The complexity of a user with an existing Avro data to convert
> > > into a
> > > >> > >> 'SAMOA compatible Avro' may become painful. Wouldn't it be
> easier
> > > if
> > > >> we
> > > >> > >> just distinguish it inside the code , say if at least one
> > attribute
> > > >> in
> > > >> > the
> > > >> > >> metadata uses the generic Avro optionality (e.g ["null",
> > > "double"]),
> > > >> > then
> > > >> > >> we do readInstanceSparse() in the Loader and map
> correspondingly
> > ?
> > > >> Or is
> > > >> > >> there some other complexity that I have not looked at?
> > > >> > >>
> > > >> > >> 2) Yes . Skipping the Date-type attributes will make it easier
> !
> > > >> > >>
> > > >> > >> Regarding the engineering aspects,
> > > >> > >>
> > > >> > >> We can have the Avro dependecy in the deployable jar of SAMOA.
> In
> > > the
> > > >> > >> code, may be
> > > >> > >>
> > > >> > >> 1) We could have an Avro equivalent of ArffFileStream.java &
> > > >> ArffLoader
> > > >> > >> 2) May be a different Reader altogether for handling binary
> > stream
> > > >> > >> 3) A user option to switch between JSON/Binary encoding
> > > >> > >>
> > > >> > >> If there is a better way to do it, kindly advice.
> > > >> > >>
> > > >> > >> Thanks
> > > >> > >> Jay
> > > >> > >> https://github.com/jayadeepj
> > > >> > >>
> > > >> > >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci
> Morales
> > <
> > > >> > >> gdfm@apache.org> wrote:
> > > >> > >>
> > > >> > >>> Hi Jayadeep,
> > > >> > >>>
> > > >> > >>> I think it's pretty cool!
> > > >> > >>> If we get both Avro and Kafka support right, we can connect to
> > > >> almost
> > > >> > >>> anything.
> > > >> > >>>
> > > >> > >>> The document looks very comprehensive, you seem to have given
> a
> > > lot
> > > >> of
> > > >> > >>> thought to it.
> > > >> > >>> I am not extremely familiar with Avro myself, I've just used
> it
> > a
> > > >> > couple
> > > >> > >>> of times, but I'll try to provide some suggestions.
> > > >> > >>>
> > > >> > >>> - The general idea of where and how to store data and
> meta-data
> > > >> seems
> > > >> > >>> right.
> > > >> > >>> - In general, all attributes in a sparse instance are
> optional,
> > > and
> > > >> all
> > > >> > >>> attributes in a dense instance are required. Maybe we want to
> be
> > > >> more
> > > >> > >>> granular than this in the future, but it seems that Avro
> > supports
> > > a
> > > >> > >>> superset of these settings. We may want to have some defaults
> > > >> > "prototypes"
> > > >> > >>> in order to make mapping the current dense/sparse instances
> > easy.
> > > >> > >>> - Right now we are not making use of Date-type attributes in
> > SAMOA
> > > >> > >>> (there is no such thing in samoa-instances), so if it makes it
> > > >> easier
> > > >> > we
> > > >> > >>> could skip supporting it. Ideally we could have algorithms
> that
> > > >> respect
> > > >> > >>> event-time as provided by timestamps in the instances (as
> > opposed
> > > to
> > > >> > >>> processing the event whenever it arrives), however we are not
> > > there
> > > >> > yet :)
> > > >> > >>>
> > > >> > >>> All the rest seems pretty straightforward.
> > > >> > >>>
> > > >> > >>> Moving to the more software-engineering oriented aspects,
> where
> > > >> would
> > > >> > we
> > > >> > >>> have dependencies for Avro? And how should they be deployed?
> > Would
> > > >> they
> > > >> > >>> simply go inside the deployable uber-jar of SAMOA?
> > > >> > >>>
> > > >> > >>> Thanks,
> > > >> > >>>
> > > >> > >>> --
> > > >> > >>> Gianmarco
> > > >> > >>>
> > > >> > >>> On 19 October 2015 at 11:24, Jayadeep J <ja...@gmail.com>
> > > >> wrote:
> > > >> > >>>
> > > >> > >>>> Hi Gianmarco / All,
> > > >> > >>>>
> > > >> > >>>> I am working on an integration of SAMOA with Apache Avro.
> > > >> Basically I
> > > >> > >>>> want to use data stored in Avro Files to be used as input to
> > > SAMOA.
> > > >> > >>>>
> > > >> > >>>> As I understand, current SAMOA readers only support ARFF
> > format.
> > > Do
> > > >> > you
> > > >> > >>>> think such a feature would be useful to SAMOA in general ?
> Avro
> > > >> > allows two
> > > >> > >>>> encodings for the data: Binary & JSON. Hence an Avro support
> > may
> > > >> allow
> > > >> > >>>> users with JSON data also to use SAMOA.
> > > >> > >>>>
> > > >> > >>>> Based on the input given by @gdfm to @ctippur, I have
> prepared
> > an
> > > >> > Input
> > > >> > >>>> Format document in Google Docs.
> > > >> > >>>>
> > > >> > >>>>
> > > >> > >>>>
> > > >> >
> > > >>
> > >
> >
> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
> > > >> > >>>>
> > > >> > >>>>
> > > >> > >>>> Would it be possible for you to have a look and provide your
> > > >> valuable
> > > >> > >>>> suggestions ? Thanks
> > > >> > >>>>
> > > >> > >>>>
> > > >> > >>>> Thanks
> > > >> > >>>> Jay
> > > >> > >>>> https://github.com/jayadeepj
> > > >> > >>>>
> > > >> > >>>
> > > >> > >>>
> > > >> > >>
> > > >> > >>
> > > >> > >> --
> > > >> > >> Thanks
> > > >> > >> Jay
> > > >> > >>
> > > >> > >>
> > > >>
> > > >
> > >
> >
>