You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Robert Metzger <rm...@apache.org> on 2014/08/26 11:29:20 UTC

Fwd: [stratosphere-dev] Re: Project for GSoC

Hi Anirvan,

I'm forwarding this message to dev@flink.incubator.apache.org. You need to
send a (empty) message to dev-subscribe@flink.incubator.apache.org to
subscribe to the dev list.
The dev@ list is for discussions with the developers, planning etc. The
user@flink.incubator.apache.org list is for user questions (for example
troubles using the API, conceptual questions etc.)
I think the message below is more suited for the dev@ list, since its
basically a feature request.

Regarding the names: We don't use Stratosphere anymore. Our codebase has
been renamed to Flink and the "org.apache.flink" namespace. So ideally this
confusion is finally out of the world.

For those who want to have a look into the history of the message, see the
Google Groups archive here:
https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ

---------- Forwarded message ----------
From: Nirvanesque <ni...@gmail.com>
Date: Tue, Aug 26, 2014 at 11:12 AM
Subject: [stratosphere-dev] Re: Project for GSoC
To: stratosphere-dev@googlegroups.com
Cc: tsekis79@gmail.com


Hello Artem and mentors,

First of all nice greetings from INRIA, France.
Hope you had an enjoyable experience in GSOC!
Thanks to Robert (rmetzger) for forwarding me here ...

At INRIA, we are starting to adopt Stratosphere / Flink.
The top-level goal is to enhance performance in User Defined Functions
(UDFs) with long workflows using multiple M-R, by using the larger set of
Second Order Functions (SOFs) in Stratosphere / Flink.
We will demonstrate this improvement by implementing some Use Cases for
business purposes.
For this purpose, we have chosen some customer analysis Use Cases using
weblogs and related data, for 2 companies (who appeared interested to try
using Stratosphere / Flink )
- a mobile phone app developer: http://www.tribeflame.com
- an anti-virus & Internet security software company: www.f-secure.com
I will be happy to share with you these Use Cases, if you are interested.
Just ask me here.

At present, we are typically in the profiles of Alice-Bob-Sam, as described
in your GSoC proposal
<https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis>.
:-)
Hadoop seems to be the starting square for the Stratosphere / Flink journey.
Same is the situation with developers in the above 2 companies :-)

Briefly,
We have installed and run some example programmes from Flink / Stratosphere
(versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our Hadoop &
Stratosphere installations)
We have some good understanding of Hadoop and its use in Streaming and
Pipes in conjunction with scripting languages (Python & R specifically)
In the first phase, we would like to run some "Hadoop-like" jobs (mainly
multiple M-R workflows) on Stratosphere, preferably with extensive Java or
Scala programming.
I refer to your GSoC project map
<https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29>
which seems very interesting.
If we could have a Hadoop abstraction as you have mentioned, that would be
ideal for our first phase.
In later phases, when we implement complex join and group operations, we
would dive deeper into Stratosphere / Flink Java or Scala APIs

Hence, I would like to know, what is the current status in this direction?
What has been implemented already? In which version onwards? How to try
them?
What is yet to be implemented? When - which versions?

You may also like to see my discussion with Robert on this page
<http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261>.

I am still mining in different discussions - here as well as on JIRA.
Please do refer me to the relevant links, JIRA tickets, etc if that saves
your time in re-typing large replies.
It will also help us to understand the train of collective thinking in the
Stratosphere / Flink roadmap.

Thanks in advance,
Anirvan
PS : Apologies for using names / rechristened names (e.g. Flink /
Stratosphere) as I am not sure, which name exactly to use currently.


On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis wrote:
>
> Hello Fabian,
>
> On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com wrote:
> > Hi Artem,
> >
> > thanks a lot for your interest in Stratosphere and participating in our
> GSoC projects!
> >
> > As you know, Hadoop is the big elephant out there in the Big Data jungle
> and widely adopted. Therefore, a Hadoop compatibility layer is a very!
> important feature for any large scale data processing system.
> > Stratosphere builds on foundations of MapReduce but generalizes its
> concepts and provides a more efficient runtime.
>
> Great!
>
> > When you have a look at the Stratosphere WordCount example program, you
> will see, that the programming principles of Stratosphere and Hadoop
> MapReduce are quite similar, although Stratosphere is not compatible with
> the Hadoop interfaces.
>
> Yes, I've looked into the example (Wordcount, k-means) I also run the big
> test job you have locally and it seems to be ok.
>
> > With the proposed project we want to achieve, that Hadoop MapReduce jobs
> can be executed on Stratosphere without changing a line of code (if
> possible).
> >
> > We have already some pieces for that in place. InputFormats are done
> (see https://github.com/stratosphere/stratosphere/
> tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats are
> work in progress. The biggest missing piece is executing Hadoop Map and
> Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces (e.g.,
> overwriting partitioning function and sorting comparators, counters,
> distributed cache, ...). It would of course be desirable to support as many
> of these interfaces as possible, but they can by added step-by-step once
> the first Hadoop jobs are running on Stratosphere.
>
> So If I understand correctly, the idea is to create logical wrappers for
> all interfaces used by Hadoop Jobs (the way it has been done with the
> hadoop datatypes) so it can be run as completely transparently as possible
> on Stratosphere in an efficient way. I agree, there are many interfaces,
> but it's very interesting considering the way Stratosphere defines tasks,
> which is a bit different (though, as you said, the principle is similar).
>
> I assume the focus is on the YARN version of Hadoop (new api)?
>
> And one last question, serialization for Stratosphere is java's default
> mechanism, right?
>
> >
> > Regarding your question about cloud deployment scripts, one of our team
> members is currently working on this (see this thread:
> https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
> > I am not sure, if this is still in the making or already done. If you
> are interested in this as well, just drop a line to the thread. Although, I
> am not very familiar with the detail of this, my gut feeling is that this
> would be a bit too less for an individual project. However, there might be
> ways to extend this. So if you have any ideas, share them with us and we
> will be happy to discuss them.
>
> Thank you for pointing up the topic. I will let you know if I come up with
> anything for this. Probably after I try deploying it on openstack.
>
> >
> > Again, thanks a lot for your interest and please don't hesitate to ask
> questions. :-)
>
> Thank you for the helpful answers.
>
> Kind regards,
> Artem
>
>
> >
> > Best,
> > Fabian
> >
> >
> > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, tsek...@gmail.com
> wrote:
> > Dear Stratosphere devs and fellow GSoC potential students,
> > Hello!
> > I'm Artem, an undergraduate student from Athens, Greece. You can find me
> on github (https://github.com/atsikiridis) and occasionally on
> stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis).
> Currently, however, I'm in Switzerland where I am doing my internship at
> CERN as back-end software developer for INSPIRE, a library for High Energy
> Physics (we're running on http://inspirehep.net/). The service is in
> python( based on the open-source project http://invenio.net) and my
> responsibilities are mostly the integration with Redis, database
> abstractions, testing (unit, regression) and helping
> > our team to integrate modern technologies and frameworks to the current
> code base.
> > Moreover, I am very interested in big data technologies, therefore
> before coming to CERN I've been trying  to make my first steps in research
> at the Big Data lab of AUEB, my home university. Mostly, the main objective
> of the project I had been involved with, was the implementation of a
> dynamic caching mechanism for Hadoop (in a way trying our cache instead of
> the built-in distributed cache). Other techs involved where Redis,
> Memcached, Ehacache (Terracotta). With this project we gained some insights
> about the internals of hadoop (new api. old api, how tasks work, hadoop
> serialization, the daemons running etc.) and hdfs, deployed clusters on
> cloud computing platforms (Openstack with Nova,  Amazon EC2 with boto). We
> also used the Java Remote API for some tests.
> > Unfortunately, I have not used Stratosphere before in a research /prod
> environment. I have only played with the examples on my local machine. It
> is very interesting and I would love to learn more.
> > There will probably be a learning curve for me on the Stratosphere side
> but implementing a Hadoop Compatibility Layer seems like a very interesting
> project and I believe I can be of use :)
> > Finally, I was wondering whether there are some command-line tools for
> deploying Stratosphere automatically for EC2 or Openstack clouds (for
> example, Stratosphere specific abstractions on top of python boto api). Do
> you that would make sense as a project?
> > Pardon me for the length of this.
> > Kind regards,
> > Artem

 --
You received this message because you are subscribed to the Google Groups
"stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to stratosphere-dev+unsubscribe@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/d/optout.

Re: [stratosphere-dev] Re: Project for GSoC

Posted by Kostas Tzoumas <kt...@apache.org>.
Hi Anirvan,

Can you give us some more information about your Hadoop jobs? Is this a
pre-existing Hadoop workload that you want to run on Flink, or are you
writing the jobs from scratch?

Kostas


On Wed, Aug 27, 2014 at 5:08 PM, Fabian Hueske <fh...@apache.org> wrote:

> Hi Anirvan,
>
> welcome on the Flink mailing list.
>
> Your examples are based on Hadoop-Streaming. I have to admit, I am not an
> expert in Hadoop-Streaming, but from my understanding it is basically a
> wrapper to run UDFs implement in a programming language other than Java.
> Flink does not support Hadoop Streaming at the moment. I don't know how it
> works under the hood but it might be that Hadoop-Streaming works once we
> have support to execute regular Hadoop jobs. However, I would not bet on or
> wait for that. There are efforts to add a Python interface for Flink, but
> this is still work in progress.
>
> I'd recommend to get started with the regular Java API.
> For starting with the Java API, I point you to the Java Quickstart on our
> website:
> -->
>
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_quickstart.html
>
> Let us know if you have further questions.
>
> Best, Fabian
>
>
>
>
> 2014-08-27 16:48 GMT+02:00 Anirvan BASU <an...@inria.fr>:
>
> > Hello Artem, Fabian and Robert,
> >
> > Thank you for your valuable inputs.
> >
> > Here is what we are trying to do currently (scratching the surface):
> >
> > Consider typical Hadoop commands of the form:
> > hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -D
> > stream.num.map.output.key.fields=2 -D
> > mapred.text.key.partitioner.options=-1,1 -inputformat
> > org.apache.hadoop.mapred.TextInputFormat -input stocks/input/stocks.txt
> > -apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
> > Here we run a hadoop job with some parameters and keys specified.
> >
> > hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar
> -input
> > /ngrams/input -output /ngrams/output -mapper ./python/mapper.py -file
> > ./python/mapper.py -combiner ./python/reducer.py -reducer
> > ./python/reducer.py -file ./python/reducer.py -jobconf
> > stream.num.map.output.key.fields=3 -jobconf
> > stream.num.reduce.output.key.fields=3 -jobconf mapred.reduce.tasks=10
> > Here we run a hadoop job with some parameters specified and using
> external
> > scripts (in Python - as you have developed a Python connector, I cite
> this
> > example) for mapper, combiner, reducer.
> >
> > The above are slightly complicated than "hello world" type wordcount
> > examples, ain't it?
> >
> > Hence,
> > We would like to know how to wrap and run the above jobs (if possible)
> > using flink.
> > In which level of compatibility would you classify the above jobs?
> >
> > If possible, could you help us by:
> > - showing how to "wrap" around such commands?
> > - what dependencies, sources, classes (to archetype in the pom.xml if
> > required to make a package) OR (to import if required to make a jar
> file)?
> >
> > Honestly, after drowning in all the information, we are trying to
> > understand the (flink) elephant as blind mice. We understand it has legs,
> > trunk, etc ... but don't understand where to start to mount and ride it.
> >
> > Happy to clarify again & again until we reach success.
> >
> > Thank you for your understanding and all your kind help,
> > Anirvan
> >
> > ----- Original Message -----
> > > From: "Fabian Hueske" <fh...@apache.org>
> > > To: dev@flink.incubator.apache.org
> > > Sent: Wednesday, August 27, 2014 10:04:18 AM
> > > Subject: Re: [stratosphere-dev] Re: Project for GSoC
> > >
> > > Hi Anirvan,
> > >
> > > we have a JIRA that tracks the HadoopCompatibility feature:
> > > https://apache.org/jira/browse/FLINK-838
> > > <https://issues.apache.org/jira/browse/FLINK-838>
> > > The basic mechanism is done, but has not been merged because we found a
> > > cleaner way to integrate the feature.
> > >
> > > There are different levels of Hadoop Compatibility and I did not fully
> > > understand which kind of compatibility you need.
> > >
> > > Conceptual Compatibility:
> > > Flink already offers UDF interfaces which are very similar to Hadoop's
> > Map,
> > > Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce).
> These
> > > interfaces are not source-code compatible, but porting the code should
> be
> > > trivial. This is already there. If you're fine with porting your code
> > from
> > > Hadoop to Flink (which should be a very small effort) you're good to
> go.
> > >
> > > Function-Level-Source Compatibility:
> > > Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink
> > > programs is not difficult (since they are very similar to their Flink
> > > versions). This is not yet included in the codebase but could be added
> at
> > > rather short notice. We do already have interface compatibility for
> > Hadoop
> > > Input- and OutputFormats. So you can use these without changing the
> code.
> > >
> > > Job-Level-Source Compatibility:
> > > This is what we aimed for with the GSoC project. Here we want to
> support
> > > the execution of a full Hadoop MR Job in Flink without changing the
> code.
> > > However, this turned out to be a bit tricky if custom partitioner,
> sort,
> > > and grouping-comparators are used in the job. Adding this feature will
> > take
> > > a bit more time.
> > >
> > > Function- and Job-Level-Source compatibility will enable the use of
> > already
> > > existing Hadoop code. If you are implementing new analysis jobs
> anyways,
> > > I'd go for a Flink implementation which eases many things such as
> > secondary
> > > sort, unions of multiple inputs, etc.
> > >
> > > Cheers,
> > > Fabian
> > >
> > >
> > >
> > >
> > > 2014-08-26 11:29 GMT+02:00 Robert Metzger <rm...@apache.org>:
> > >
> > > > Hi Anirvan,
> > > >
> > > > I'm forwarding this message to dev@flink.incubator.apache.org. You
> > need to
> > > > send a (empty) message to dev-subscribe@flink.incubator.apache.org
> to
> > > > subscribe to the dev list.
> > > > The dev@ list is for discussions with the developers, planning etc.
> > The
> > > > user@flink.incubator.apache.org list is for user questions (for
> > example
> > > > troubles using the API, conceptual questions etc.)
> > > > I think the message below is more suited for the dev@ list, since
> its
> > > > basically a feature request.
> > > >
> > > > Regarding the names: We don't use Stratosphere anymore. Our codebase
> > has
> > > > been renamed to Flink and the "org.apache.flink" namespace. So
> ideally
> > this
> > > > confusion is finally out of the world.
> > > >
> > > > For those who want to have a look into the history of the message,
> see
> > the
> > > > Google Groups archive here:
> > > > https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ
> > > >
> > > > ---------- Forwarded message ----------
> > > > From: Nirvanesque <ni...@gmail.com>
> > > > Date: Tue, Aug 26, 2014 at 11:12 AM
> > > > Subject: [stratosphere-dev] Re: Project for GSoC
> > > > To: stratosphere-dev@googlegroups.com
> > > > Cc: tsekis79@gmail.com
> > > >
> > > >
> > > > Hello Artem and mentors,
> > > >
> > > > First of all nice greetings from INRIA, France.
> > > > Hope you had an enjoyable experience in GSOC!
> > > > Thanks to Robert (rmetzger) for forwarding me here ...
> > > >
> > > > At INRIA, we are starting to adopt Stratosphere / Flink.
> > > > The top-level goal is to enhance performance in User Defined
> Functions
> > > > (UDFs) with long workflows using multiple M-R, by using the larger
> set
> > of
> > > > Second Order Functions (SOFs) in Stratosphere / Flink.
> > > > We will demonstrate this improvement by implementing some Use Cases
> for
> > > > business purposes.
> > > > For this purpose, we have chosen some customer analysis Use Cases
> using
> > > > weblogs and related data, for 2 companies (who appeared interested to
> > try
> > > > using Stratosphere / Flink )
> > > > - a mobile phone app developer: http://www.tribeflame.com
> > > > - an anti-virus & Internet security software company:
> www.f-secure.com
> > > > I will be happy to share with you these Use Cases, if you are
> > interested.
> > > > Just ask me here.
> > > >
> > > > At present, we are typically in the profiles of Alice-Bob-Sam, as
> > described
> > > > in your GSoC proposal
> > > > <
> > > >
> >
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> > > > >.
> > > > :-)
> > > > Hadoop seems to be the starting square for the Stratosphere / Flink
> > > > journey.
> > > > Same is the situation with developers in the above 2 companies :-)
> > > >
> > > > Briefly,
> > > > We have installed and run some example programmes from Flink /
> > Stratosphere
> > > > (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our
> > Hadoop &
> > > > Stratosphere installations)
> > > > We have some good understanding of Hadoop and its use in Streaming
> and
> > > > Pipes in conjunction with scripting languages (Python & R
> specifically)
> > > > In the first phase, we would like to run some "Hadoop-like" jobs
> > (mainly
> > > > multiple M-R workflows) on Stratosphere, preferably with extensive
> > Java or
> > > > Scala programming.
> > > > I refer to your GSoC project map
> > > > <
> > > >
> >
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
> > > > >
> > > > which seems very interesting.
> > > > If we could have a Hadoop abstraction as you have mentioned, that
> > would be
> > > > ideal for our first phase.
> > > > In later phases, when we implement complex join and group operations,
> > we
> > > > would dive deeper into Stratosphere / Flink Java or Scala APIs
> > > >
> > > > Hence, I would like to know, what is the current status in this
> > direction?
> > > > What has been implemented already? In which version onwards? How to
> try
> > > > them?
> > > > What is yet to be implemented? When - which versions?
> > > >
> > > > You may also like to see my discussion with Robert on this page
> > > > <
> > > >
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261
> > > > >.
> > > >
> > > > I am still mining in different discussions - here as well as on JIRA.
> > > > Please do refer me to the relevant links, JIRA tickets, etc if that
> > saves
> > > > your time in re-typing large replies.
> > > > It will also help us to understand the train of collective thinking
> in
> > the
> > > > Stratosphere / Flink roadmap.
> > > >
> > > > Thanks in advance,
> > > > Anirvan
> > > > PS : Apologies for using names / rechristened names (e.g. Flink /
> > > > Stratosphere) as I am not sure, which name exactly to use currently.
> > > >
> > > >
> > > > On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis
> > wrote:
> > > > >
> > > > > Hello Fabian,
> > > > >
> > > > > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com
> > wrote:
> > > > > > Hi Artem,
> > > > > >
> > > > > > thanks a lot for your interest in Stratosphere and participating
> > in our
> > > > > GSoC projects!
> > > > > >
> > > > > > As you know, Hadoop is the big elephant out there in the Big Data
> > > > jungle
> > > > > and widely adopted. Therefore, a Hadoop compatibility layer is a
> > very!
> > > > > important feature for any large scale data processing system.
> > > > > > Stratosphere builds on foundations of MapReduce but generalizes
> its
> > > > > concepts and provides a more efficient runtime.
> > > > >
> > > > > Great!
> > > > >
> > > > > > When you have a look at the Stratosphere WordCount example
> > program, you
> > > > > will see, that the programming principles of Stratosphere and
> Hadoop
> > > > > MapReduce are quite similar, although Stratosphere is not
> compatible
> > with
> > > > > the Hadoop interfaces.
> > > > >
> > > > > Yes, I've looked into the example (Wordcount, k-means) I also run
> > the big
> > > > > test job you have locally and it seems to be ok.
> > > > >
> > > > > > With the proposed project we want to achieve, that Hadoop
> MapReduce
> > > > jobs
> > > > > can be executed on Stratosphere without changing a line of code (if
> > > > > possible).
> > > > > >
> > > > > > We have already some pieces for that in place. InputFormats are
> > done
> > > > > (see https://github.com/stratosphere/stratosphere/
> > > > > tree/master/stratosphere-addons/hadoop-compatibility),
> OutputFormats
> > are
> > > > > work in progress. The biggest missing piece is executing Hadoop Map
> > and
> > > > > Reduce tasks in Stratosphere. Hadoop provides quite a few
> interfaces
> > > > (e.g.,
> > > > > overwriting partitioning function and sorting comparators,
> counters,
> > > > > distributed cache, ...). It would of course be desirable to support
> > as
> > > > many
> > > > > of these interfaces as possible, but they can by added step-by-step
> > once
> > > > > the first Hadoop jobs are running on Stratosphere.
> > > > >
> > > > > So If I understand correctly, the idea is to create logical
> wrappers
> > for
> > > > > all interfaces used by Hadoop Jobs (the way it has been done with
> the
> > > > > hadoop datatypes) so it can be run as completely transparently as
> > > > possible
> > > > > on Stratosphere in an efficient way. I agree, there are many
> > interfaces,
> > > > > but it's very interesting considering the way Stratosphere defines
> > tasks,
> > > > > which is a bit different (though, as you said, the principle is
> > similar).
> > > > >
> > > > > I assume the focus is on the YARN version of Hadoop (new api)?
> > > > >
> > > > > And one last question, serialization for Stratosphere is java's
> > default
> > > > > mechanism, right?
> > > > >
> > > > > >
> > > > > > Regarding your question about cloud deployment scripts, one of
> our
> > team
> > > > > members is currently working on this (see this thread:
> > > > >
> https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo
> > ).
> > > > > > I am not sure, if this is still in the making or already done. If
> > you
> > > > > are interested in this as well, just drop a line to the thread.
> > > > Although, I
> > > > > am not very familiar with the detail of this, my gut feeling is
> that
> > this
> > > > > would be a bit too less for an individual project. However, there
> > might
> > > > be
> > > > > ways to extend this. So if you have any ideas, share them with us
> > and we
> > > > > will be happy to discuss them.
> > > > >
> > > > > Thank you for pointing up the topic. I will let you know if I come
> up
> > > > with
> > > > > anything for this. Probably after I try deploying it on openstack.
> > > > >
> > > > > >
> > > > > > Again, thanks a lot for your interest and please don't hesitate
> to
> > ask
> > > > > questions. :-)
> > > > >
> > > > > Thank you for the helpful answers.
> > > > >
> > > > > Kind regards,
> > > > > Artem
> > > > >
> > > > >
> > > > > >
> > > > > > Best,
> > > > > > Fabian
> > > > > >
> > > > > >
> > > > > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1,
> tsek...@gmail.com
> > > > > wrote:
> > > > > > Dear Stratosphere devs and fellow GSoC potential students,
> > > > > > Hello!
> > > > > > I'm Artem, an undergraduate student from Athens, Greece. You can
> > find
> > > > me
> > > > > on github (https://github.com/atsikiridis) and occasionally on
> > > > > stackoverflow (
> > http://stackoverflow.com/users/2568511/artem-tsikiridis).
> > > > > Currently, however, I'm in Switzerland where I am doing my
> > internship at
> > > > > CERN as back-end software developer for INSPIRE, a library for High
> > > > Energy
> > > > > Physics (we're running on http://inspirehep.net/). The service is
> in
> > > > > python( based on the open-source project http://invenio.net) and
> my
> > > > > responsibilities are mostly the integration with Redis, database
> > > > > abstractions, testing (unit, regression) and helping
> > > > > > our team to integrate modern technologies and frameworks to the
> > current
> > > > > code base.
> > > > > > Moreover, I am very interested in big data technologies,
> therefore
> > > > > before coming to CERN I've been trying  to make my first steps in
> > > > research
> > > > > at the Big Data lab of AUEB, my home university. Mostly, the main
> > > > objective
> > > > > of the project I had been involved with, was the implementation of
> a
> > > > > dynamic caching mechanism for Hadoop (in a way trying our cache
> > instead
> > > > of
> > > > > the built-in distributed cache). Other techs involved where Redis,
> > > > > Memcached, Ehacache (Terracotta). With this project we gained some
> > > > insights
> > > > > about the internals of hadoop (new api. old api, how tasks work,
> > hadoop
> > > > > serialization, the daemons running etc.) and hdfs, deployed
> clusters
> > on
> > > > > cloud computing platforms (Openstack with Nova,  Amazon EC2 with
> > boto).
> > > > We
> > > > > also used the Java Remote API for some tests.
> > > > > > Unfortunately, I have not used Stratosphere before in a research
> > /prod
> > > > > environment. I have only played with the examples on my local
> > machine. It
> > > > > is very interesting and I would love to learn more.
> > > > > > There will probably be a learning curve for me on the
> Stratosphere
> > side
> > > > > but implementing a Hadoop Compatibility Layer seems like a very
> > > > interesting
> > > > > project and I believe I can be of use :)
> > > > > > Finally, I was wondering whether there are some command-line
> tools
> > for
> > > > > deploying Stratosphere automatically for EC2 or Openstack clouds
> (for
> > > > > example, Stratosphere specific abstractions on top of python boto
> > api).
> > > > Do
> > > > > you that would make sense as a project?
> > > > > > Pardon me for the length of this.
> > > > > > Kind regards,
> > > > > > Artem
> > > >
> > > >  --
> > > > You received this message because you are subscribed to the Google
> > Groups
> > > > "stratosphere-dev" group.
> > > > To unsubscribe from this group and stop receiving emails from it,
> send
> > an
> > > > email to stratosphere-dev+unsubscribe@googlegroups.com.
> > > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > > For more options, visit https://groups.google.com/d/optout.
> > > >
> > >
> >
>

Re: [stratosphere-dev] Re: Project for GSoC

Posted by Fabian Hueske <fh...@apache.org>.
Hi Anirvan,

welcome on the Flink mailing list.

Your examples are based on Hadoop-Streaming. I have to admit, I am not an
expert in Hadoop-Streaming, but from my understanding it is basically a
wrapper to run UDFs implement in a programming language other than Java.
Flink does not support Hadoop Streaming at the moment. I don't know how it
works under the hood but it might be that Hadoop-Streaming works once we
have support to execute regular Hadoop jobs. However, I would not bet on or
wait for that. There are efforts to add a Python interface for Flink, but
this is still work in progress.

I'd recommend to get started with the regular Java API.
For starting with the Java API, I point you to the Java Quickstart on our
website:
-->
http://flink.incubator.apache.org/docs/0.6-incubating/java_api_quickstart.html

Let us know if you have further questions.

Best, Fabian




2014-08-27 16:48 GMT+02:00 Anirvan BASU <an...@inria.fr>:

> Hello Artem, Fabian and Robert,
>
> Thank you for your valuable inputs.
>
> Here is what we are trying to do currently (scratching the surface):
>
> Consider typical Hadoop commands of the form:
> hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -D
> stream.num.map.output.key.fields=2 -D
> mapred.text.key.partitioner.options=-1,1 -inputformat
> org.apache.hadoop.mapred.TextInputFormat -input stocks/input/stocks.txt
> -apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
> Here we run a hadoop job with some parameters and keys specified.
>
> hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -input
> /ngrams/input -output /ngrams/output -mapper ./python/mapper.py -file
> ./python/mapper.py -combiner ./python/reducer.py -reducer
> ./python/reducer.py -file ./python/reducer.py -jobconf
> stream.num.map.output.key.fields=3 -jobconf
> stream.num.reduce.output.key.fields=3 -jobconf mapred.reduce.tasks=10
> Here we run a hadoop job with some parameters specified and using external
> scripts (in Python - as you have developed a Python connector, I cite this
> example) for mapper, combiner, reducer.
>
> The above are slightly complicated than "hello world" type wordcount
> examples, ain't it?
>
> Hence,
> We would like to know how to wrap and run the above jobs (if possible)
> using flink.
> In which level of compatibility would you classify the above jobs?
>
> If possible, could you help us by:
> - showing how to "wrap" around such commands?
> - what dependencies, sources, classes (to archetype in the pom.xml if
> required to make a package) OR (to import if required to make a jar file)?
>
> Honestly, after drowning in all the information, we are trying to
> understand the (flink) elephant as blind mice. We understand it has legs,
> trunk, etc ... but don't understand where to start to mount and ride it.
>
> Happy to clarify again & again until we reach success.
>
> Thank you for your understanding and all your kind help,
> Anirvan
>
> ----- Original Message -----
> > From: "Fabian Hueske" <fh...@apache.org>
> > To: dev@flink.incubator.apache.org
> > Sent: Wednesday, August 27, 2014 10:04:18 AM
> > Subject: Re: [stratosphere-dev] Re: Project for GSoC
> >
> > Hi Anirvan,
> >
> > we have a JIRA that tracks the HadoopCompatibility feature:
> > https://apache.org/jira/browse/FLINK-838
> > <https://issues.apache.org/jira/browse/FLINK-838>
> > The basic mechanism is done, but has not been merged because we found a
> > cleaner way to integrate the feature.
> >
> > There are different levels of Hadoop Compatibility and I did not fully
> > understand which kind of compatibility you need.
> >
> > Conceptual Compatibility:
> > Flink already offers UDF interfaces which are very similar to Hadoop's
> Map,
> > Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce). These
> > interfaces are not source-code compatible, but porting the code should be
> > trivial. This is already there. If you're fine with porting your code
> from
> > Hadoop to Flink (which should be a very small effort) you're good to go.
> >
> > Function-Level-Source Compatibility:
> > Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink
> > programs is not difficult (since they are very similar to their Flink
> > versions). This is not yet included in the codebase but could be added at
> > rather short notice. We do already have interface compatibility for
> Hadoop
> > Input- and OutputFormats. So you can use these without changing the code.
> >
> > Job-Level-Source Compatibility:
> > This is what we aimed for with the GSoC project. Here we want to support
> > the execution of a full Hadoop MR Job in Flink without changing the code.
> > However, this turned out to be a bit tricky if custom partitioner, sort,
> > and grouping-comparators are used in the job. Adding this feature will
> take
> > a bit more time.
> >
> > Function- and Job-Level-Source compatibility will enable the use of
> already
> > existing Hadoop code. If you are implementing new analysis jobs anyways,
> > I'd go for a Flink implementation which eases many things such as
> secondary
> > sort, unions of multiple inputs, etc.
> >
> > Cheers,
> > Fabian
> >
> >
> >
> >
> > 2014-08-26 11:29 GMT+02:00 Robert Metzger <rm...@apache.org>:
> >
> > > Hi Anirvan,
> > >
> > > I'm forwarding this message to dev@flink.incubator.apache.org. You
> need to
> > > send a (empty) message to dev-subscribe@flink.incubator.apache.org to
> > > subscribe to the dev list.
> > > The dev@ list is for discussions with the developers, planning etc.
> The
> > > user@flink.incubator.apache.org list is for user questions (for
> example
> > > troubles using the API, conceptual questions etc.)
> > > I think the message below is more suited for the dev@ list, since its
> > > basically a feature request.
> > >
> > > Regarding the names: We don't use Stratosphere anymore. Our codebase
> has
> > > been renamed to Flink and the "org.apache.flink" namespace. So ideally
> this
> > > confusion is finally out of the world.
> > >
> > > For those who want to have a look into the history of the message, see
> the
> > > Google Groups archive here:
> > > https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ
> > >
> > > ---------- Forwarded message ----------
> > > From: Nirvanesque <ni...@gmail.com>
> > > Date: Tue, Aug 26, 2014 at 11:12 AM
> > > Subject: [stratosphere-dev] Re: Project for GSoC
> > > To: stratosphere-dev@googlegroups.com
> > > Cc: tsekis79@gmail.com
> > >
> > >
> > > Hello Artem and mentors,
> > >
> > > First of all nice greetings from INRIA, France.
> > > Hope you had an enjoyable experience in GSOC!
> > > Thanks to Robert (rmetzger) for forwarding me here ...
> > >
> > > At INRIA, we are starting to adopt Stratosphere / Flink.
> > > The top-level goal is to enhance performance in User Defined Functions
> > > (UDFs) with long workflows using multiple M-R, by using the larger set
> of
> > > Second Order Functions (SOFs) in Stratosphere / Flink.
> > > We will demonstrate this improvement by implementing some Use Cases for
> > > business purposes.
> > > For this purpose, we have chosen some customer analysis Use Cases using
> > > weblogs and related data, for 2 companies (who appeared interested to
> try
> > > using Stratosphere / Flink )
> > > - a mobile phone app developer: http://www.tribeflame.com
> > > - an anti-virus & Internet security software company: www.f-secure.com
> > > I will be happy to share with you these Use Cases, if you are
> interested.
> > > Just ask me here.
> > >
> > > At present, we are typically in the profiles of Alice-Bob-Sam, as
> described
> > > in your GSoC proposal
> > > <
> > >
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> > > >.
> > > :-)
> > > Hadoop seems to be the starting square for the Stratosphere / Flink
> > > journey.
> > > Same is the situation with developers in the above 2 companies :-)
> > >
> > > Briefly,
> > > We have installed and run some example programmes from Flink /
> Stratosphere
> > > (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our
> Hadoop &
> > > Stratosphere installations)
> > > We have some good understanding of Hadoop and its use in Streaming and
> > > Pipes in conjunction with scripting languages (Python & R specifically)
> > > In the first phase, we would like to run some "Hadoop-like" jobs
> (mainly
> > > multiple M-R workflows) on Stratosphere, preferably with extensive
> Java or
> > > Scala programming.
> > > I refer to your GSoC project map
> > > <
> > >
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
> > > >
> > > which seems very interesting.
> > > If we could have a Hadoop abstraction as you have mentioned, that
> would be
> > > ideal for our first phase.
> > > In later phases, when we implement complex join and group operations,
> we
> > > would dive deeper into Stratosphere / Flink Java or Scala APIs
> > >
> > > Hence, I would like to know, what is the current status in this
> direction?
> > > What has been implemented already? In which version onwards? How to try
> > > them?
> > > What is yet to be implemented? When - which versions?
> > >
> > > You may also like to see my discussion with Robert on this page
> > > <
> > >
> http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261
> > > >.
> > >
> > > I am still mining in different discussions - here as well as on JIRA.
> > > Please do refer me to the relevant links, JIRA tickets, etc if that
> saves
> > > your time in re-typing large replies.
> > > It will also help us to understand the train of collective thinking in
> the
> > > Stratosphere / Flink roadmap.
> > >
> > > Thanks in advance,
> > > Anirvan
> > > PS : Apologies for using names / rechristened names (e.g. Flink /
> > > Stratosphere) as I am not sure, which name exactly to use currently.
> > >
> > >
> > > On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis
> wrote:
> > > >
> > > > Hello Fabian,
> > > >
> > > > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com
> wrote:
> > > > > Hi Artem,
> > > > >
> > > > > thanks a lot for your interest in Stratosphere and participating
> in our
> > > > GSoC projects!
> > > > >
> > > > > As you know, Hadoop is the big elephant out there in the Big Data
> > > jungle
> > > > and widely adopted. Therefore, a Hadoop compatibility layer is a
> very!
> > > > important feature for any large scale data processing system.
> > > > > Stratosphere builds on foundations of MapReduce but generalizes its
> > > > concepts and provides a more efficient runtime.
> > > >
> > > > Great!
> > > >
> > > > > When you have a look at the Stratosphere WordCount example
> program, you
> > > > will see, that the programming principles of Stratosphere and Hadoop
> > > > MapReduce are quite similar, although Stratosphere is not compatible
> with
> > > > the Hadoop interfaces.
> > > >
> > > > Yes, I've looked into the example (Wordcount, k-means) I also run
> the big
> > > > test job you have locally and it seems to be ok.
> > > >
> > > > > With the proposed project we want to achieve, that Hadoop MapReduce
> > > jobs
> > > > can be executed on Stratosphere without changing a line of code (if
> > > > possible).
> > > > >
> > > > > We have already some pieces for that in place. InputFormats are
> done
> > > > (see https://github.com/stratosphere/stratosphere/
> > > > tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats
> are
> > > > work in progress. The biggest missing piece is executing Hadoop Map
> and
> > > > Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces
> > > (e.g.,
> > > > overwriting partitioning function and sorting comparators, counters,
> > > > distributed cache, ...). It would of course be desirable to support
> as
> > > many
> > > > of these interfaces as possible, but they can by added step-by-step
> once
> > > > the first Hadoop jobs are running on Stratosphere.
> > > >
> > > > So If I understand correctly, the idea is to create logical wrappers
> for
> > > > all interfaces used by Hadoop Jobs (the way it has been done with the
> > > > hadoop datatypes) so it can be run as completely transparently as
> > > possible
> > > > on Stratosphere in an efficient way. I agree, there are many
> interfaces,
> > > > but it's very interesting considering the way Stratosphere defines
> tasks,
> > > > which is a bit different (though, as you said, the principle is
> similar).
> > > >
> > > > I assume the focus is on the YARN version of Hadoop (new api)?
> > > >
> > > > And one last question, serialization for Stratosphere is java's
> default
> > > > mechanism, right?
> > > >
> > > > >
> > > > > Regarding your question about cloud deployment scripts, one of our
> team
> > > > members is currently working on this (see this thread:
> > > > https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo
> ).
> > > > > I am not sure, if this is still in the making or already done. If
> you
> > > > are interested in this as well, just drop a line to the thread.
> > > Although, I
> > > > am not very familiar with the detail of this, my gut feeling is that
> this
> > > > would be a bit too less for an individual project. However, there
> might
> > > be
> > > > ways to extend this. So if you have any ideas, share them with us
> and we
> > > > will be happy to discuss them.
> > > >
> > > > Thank you for pointing up the topic. I will let you know if I come up
> > > with
> > > > anything for this. Probably after I try deploying it on openstack.
> > > >
> > > > >
> > > > > Again, thanks a lot for your interest and please don't hesitate to
> ask
> > > > questions. :-)
> > > >
> > > > Thank you for the helpful answers.
> > > >
> > > > Kind regards,
> > > > Artem
> > > >
> > > >
> > > > >
> > > > > Best,
> > > > > Fabian
> > > > >
> > > > >
> > > > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, tsek...@gmail.com
> > > > wrote:
> > > > > Dear Stratosphere devs and fellow GSoC potential students,
> > > > > Hello!
> > > > > I'm Artem, an undergraduate student from Athens, Greece. You can
> find
> > > me
> > > > on github (https://github.com/atsikiridis) and occasionally on
> > > > stackoverflow (
> http://stackoverflow.com/users/2568511/artem-tsikiridis).
> > > > Currently, however, I'm in Switzerland where I am doing my
> internship at
> > > > CERN as back-end software developer for INSPIRE, a library for High
> > > Energy
> > > > Physics (we're running on http://inspirehep.net/). The service is in
> > > > python( based on the open-source project http://invenio.net) and my
> > > > responsibilities are mostly the integration with Redis, database
> > > > abstractions, testing (unit, regression) and helping
> > > > > our team to integrate modern technologies and frameworks to the
> current
> > > > code base.
> > > > > Moreover, I am very interested in big data technologies, therefore
> > > > before coming to CERN I've been trying  to make my first steps in
> > > research
> > > > at the Big Data lab of AUEB, my home university. Mostly, the main
> > > objective
> > > > of the project I had been involved with, was the implementation of a
> > > > dynamic caching mechanism for Hadoop (in a way trying our cache
> instead
> > > of
> > > > the built-in distributed cache). Other techs involved where Redis,
> > > > Memcached, Ehacache (Terracotta). With this project we gained some
> > > insights
> > > > about the internals of hadoop (new api. old api, how tasks work,
> hadoop
> > > > serialization, the daemons running etc.) and hdfs, deployed clusters
> on
> > > > cloud computing platforms (Openstack with Nova,  Amazon EC2 with
> boto).
> > > We
> > > > also used the Java Remote API for some tests.
> > > > > Unfortunately, I have not used Stratosphere before in a research
> /prod
> > > > environment. I have only played with the examples on my local
> machine. It
> > > > is very interesting and I would love to learn more.
> > > > > There will probably be a learning curve for me on the Stratosphere
> side
> > > > but implementing a Hadoop Compatibility Layer seems like a very
> > > interesting
> > > > project and I believe I can be of use :)
> > > > > Finally, I was wondering whether there are some command-line tools
> for
> > > > deploying Stratosphere automatically for EC2 or Openstack clouds (for
> > > > example, Stratosphere specific abstractions on top of python boto
> api).
> > > Do
> > > > you that would make sense as a project?
> > > > > Pardon me for the length of this.
> > > > > Kind regards,
> > > > > Artem
> > >
> > >  --
> > > You received this message because you are subscribed to the Google
> Groups
> > > "stratosphere-dev" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> an
> > > email to stratosphere-dev+unsubscribe@googlegroups.com.
> > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> >
>

Re: [stratosphere-dev] Re: Project for GSoC

Posted by Artem Tsikiridis <ts...@gmail.com>.
Hi Anirvan,

Thank you very much for your interest.

I would just like to add to what Fabian has said, that If you plan to use
Job-Level-Source Compatibility, we have so far focused on the mapred API
 (there is also mapreduce). You should be able to use jobs that consist of
Mappers, Reducers and configure these tasks with a JobConf (meaning you can
adjust parallelism and use the DistributedCache) in the code that should be
merged in short notice. Here is an example:
https://github.com/atsikiridis/incubator-flink/blob/hadoop-docs/docs/hadoop_compatability.md
(it has not been merged yet)

Cheers,
Artem


On Wed, Aug 27, 2014 at 10:04 AM, Fabian Hueske <fh...@apache.org> wrote:

> Hi Anirvan,
>
> we have a JIRA that tracks the HadoopCompatibility feature:
> https://apache.org/jira/browse/FLINK-838
> <https://issues.apache.org/jira/browse/FLINK-838>
> The basic mechanism is done, but has not been merged because we found a
> cleaner way to integrate the feature.
>
> There are different levels of Hadoop Compatibility and I did not fully
> understand which kind of compatibility you need.
>
> Conceptual Compatibility:
> Flink already offers UDF interfaces which are very similar to Hadoop's Map,
> Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce). These
> interfaces are not source-code compatible, but porting the code should be
> trivial. This is already there. If you're fine with porting your code from
> Hadoop to Flink (which should be a very small effort) you're good to go.
>
> Function-Level-Source Compatibility:
> Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink
> programs is not difficult (since they are very similar to their Flink
> versions). This is not yet included in the codebase but could be added at
> rather short notice. We do already have interface compatibility for Hadoop
> Input- and OutputFormats. So you can use these without changing the code.
>
> Job-Level-Source Compatibility:
> This is what we aimed for with the GSoC project. Here we want to support
> the execution of a full Hadoop MR Job in Flink without changing the code.
> However, this turned out to be a bit tricky if custom partitioner, sort,
> and grouping-comparators are used in the job. Adding this feature will take
> a bit more time.
>
> Function- and Job-Level-Source compatibility will enable the use of already
> existing Hadoop code. If you are implementing new analysis jobs anyways,
> I'd go for a Flink implementation which eases many things such as secondary
> sort, unions of multiple inputs, etc.
>
> Cheers,
> Fabian
>
>
>
>
> 2014-08-26 11:29 GMT+02:00 Robert Metzger <rm...@apache.org>:
>
> > Hi Anirvan,
> >
> > I'm forwarding this message to dev@flink.incubator.apache.org. You need
> to
> > send a (empty) message to dev-subscribe@flink.incubator.apache.org to
> > subscribe to the dev list.
> > The dev@ list is for discussions with the developers, planning etc. The
> > user@flink.incubator.apache.org list is for user questions (for example
> > troubles using the API, conceptual questions etc.)
> > I think the message below is more suited for the dev@ list, since its
> > basically a feature request.
> >
> > Regarding the names: We don't use Stratosphere anymore. Our codebase has
> > been renamed to Flink and the "org.apache.flink" namespace. So ideally
> this
> > confusion is finally out of the world.
> >
> > For those who want to have a look into the history of the message, see
> the
> > Google Groups archive here:
> > https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ
> >
> > ---------- Forwarded message ----------
> > From: Nirvanesque <ni...@gmail.com>
> > Date: Tue, Aug 26, 2014 at 11:12 AM
> > Subject: [stratosphere-dev] Re: Project for GSoC
> > To: stratosphere-dev@googlegroups.com
> > Cc: tsekis79@gmail.com
> >
> >
> > Hello Artem and mentors,
> >
> > First of all nice greetings from INRIA, France.
> > Hope you had an enjoyable experience in GSOC!
> > Thanks to Robert (rmetzger) for forwarding me here ...
> >
> > At INRIA, we are starting to adopt Stratosphere / Flink.
> > The top-level goal is to enhance performance in User Defined Functions
> > (UDFs) with long workflows using multiple M-R, by using the larger set of
> > Second Order Functions (SOFs) in Stratosphere / Flink.
> > We will demonstrate this improvement by implementing some Use Cases for
> > business purposes.
> > For this purpose, we have chosen some customer analysis Use Cases using
> > weblogs and related data, for 2 companies (who appeared interested to try
> > using Stratosphere / Flink )
> > - a mobile phone app developer: http://www.tribeflame.com
> > - an anti-virus & Internet security software company: www.f-secure.com
> > I will be happy to share with you these Use Cases, if you are interested.
> > Just ask me here.
> >
> > At present, we are typically in the profiles of Alice-Bob-Sam, as
> described
> > in your GSoC proposal
> > <
> >
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> > >.
> > :-)
> > Hadoop seems to be the starting square for the Stratosphere / Flink
> > journey.
> > Same is the situation with developers in the above 2 companies :-)
> >
> > Briefly,
> > We have installed and run some example programmes from Flink /
> Stratosphere
> > (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our Hadoop &
> > Stratosphere installations)
> > We have some good understanding of Hadoop and its use in Streaming and
> > Pipes in conjunction with scripting languages (Python & R specifically)
> > In the first phase, we would like to run some "Hadoop-like" jobs (mainly
> > multiple M-R workflows) on Stratosphere, preferably with extensive Java
> or
> > Scala programming.
> > I refer to your GSoC project map
> > <
> >
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
> > >
> > which seems very interesting.
> > If we could have a Hadoop abstraction as you have mentioned, that would
> be
> > ideal for our first phase.
> > In later phases, when we implement complex join and group operations, we
> > would dive deeper into Stratosphere / Flink Java or Scala APIs
> >
> > Hence, I would like to know, what is the current status in this
> direction?
> > What has been implemented already? In which version onwards? How to try
> > them?
> > What is yet to be implemented? When - which versions?
> >
> > You may also like to see my discussion with Robert on this page
> > <
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261
> > >.
> >
> > I am still mining in different discussions - here as well as on JIRA.
> > Please do refer me to the relevant links, JIRA tickets, etc if that saves
> > your time in re-typing large replies.
> > It will also help us to understand the train of collective thinking in
> the
> > Stratosphere / Flink roadmap.
> >
> > Thanks in advance,
> > Anirvan
> > PS : Apologies for using names / rechristened names (e.g. Flink /
> > Stratosphere) as I am not sure, which name exactly to use currently.
> >
> >
> > On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis wrote:
> > >
> > > Hello Fabian,
> > >
> > > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com
> wrote:
> > > > Hi Artem,
> > > >
> > > > thanks a lot for your interest in Stratosphere and participating in
> our
> > > GSoC projects!
> > > >
> > > > As you know, Hadoop is the big elephant out there in the Big Data
> > jungle
> > > and widely adopted. Therefore, a Hadoop compatibility layer is a very!
> > > important feature for any large scale data processing system.
> > > > Stratosphere builds on foundations of MapReduce but generalizes its
> > > concepts and provides a more efficient runtime.
> > >
> > > Great!
> > >
> > > > When you have a look at the Stratosphere WordCount example program,
> you
> > > will see, that the programming principles of Stratosphere and Hadoop
> > > MapReduce are quite similar, although Stratosphere is not compatible
> with
> > > the Hadoop interfaces.
> > >
> > > Yes, I've looked into the example (Wordcount, k-means) I also run the
> big
> > > test job you have locally and it seems to be ok.
> > >
> > > > With the proposed project we want to achieve, that Hadoop MapReduce
> > jobs
> > > can be executed on Stratosphere without changing a line of code (if
> > > possible).
> > > >
> > > > We have already some pieces for that in place. InputFormats are done
> > > (see https://github.com/stratosphere/stratosphere/
> > > tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats
> are
> > > work in progress. The biggest missing piece is executing Hadoop Map and
> > > Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces
> > (e.g.,
> > > overwriting partitioning function and sorting comparators, counters,
> > > distributed cache, ...). It would of course be desirable to support as
> > many
> > > of these interfaces as possible, but they can by added step-by-step
> once
> > > the first Hadoop jobs are running on Stratosphere.
> > >
> > > So If I understand correctly, the idea is to create logical wrappers
> for
> > > all interfaces used by Hadoop Jobs (the way it has been done with the
> > > hadoop datatypes) so it can be run as completely transparently as
> > possible
> > > on Stratosphere in an efficient way. I agree, there are many
> interfaces,
> > > but it's very interesting considering the way Stratosphere defines
> tasks,
> > > which is a bit different (though, as you said, the principle is
> similar).
> > >
> > > I assume the focus is on the YARN version of Hadoop (new api)?
> > >
> > > And one last question, serialization for Stratosphere is java's default
> > > mechanism, right?
> > >
> > > >
> > > > Regarding your question about cloud deployment scripts, one of our
> team
> > > members is currently working on this (see this thread:
> > > https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
> > > > I am not sure, if this is still in the making or already done. If you
> > > are interested in this as well, just drop a line to the thread.
> > Although, I
> > > am not very familiar with the detail of this, my gut feeling is that
> this
> > > would be a bit too less for an individual project. However, there might
> > be
> > > ways to extend this. So if you have any ideas, share them with us and
> we
> > > will be happy to discuss them.
> > >
> > > Thank you for pointing up the topic. I will let you know if I come up
> > with
> > > anything for this. Probably after I try deploying it on openstack.
> > >
> > > >
> > > > Again, thanks a lot for your interest and please don't hesitate to
> ask
> > > questions. :-)
> > >
> > > Thank you for the helpful answers.
> > >
> > > Kind regards,
> > > Artem
> > >
> > >
> > > >
> > > > Best,
> > > > Fabian
> > > >
> > > >
> > > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, tsek...@gmail.com
> > > wrote:
> > > > Dear Stratosphere devs and fellow GSoC potential students,
> > > > Hello!
> > > > I'm Artem, an undergraduate student from Athens, Greece. You can find
> > me
> > > on github (https://github.com/atsikiridis) and occasionally on
> > > stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis
> ).
> > > Currently, however, I'm in Switzerland where I am doing my internship
> at
> > > CERN as back-end software developer for INSPIRE, a library for High
> > Energy
> > > Physics (we're running on http://inspirehep.net/). The service is in
> > > python( based on the open-source project http://invenio.net) and my
> > > responsibilities are mostly the integration with Redis, database
> > > abstractions, testing (unit, regression) and helping
> > > > our team to integrate modern technologies and frameworks to the
> current
> > > code base.
> > > > Moreover, I am very interested in big data technologies, therefore
> > > before coming to CERN I've been trying  to make my first steps in
> > research
> > > at the Big Data lab of AUEB, my home university. Mostly, the main
> > objective
> > > of the project I had been involved with, was the implementation of a
> > > dynamic caching mechanism for Hadoop (in a way trying our cache instead
> > of
> > > the built-in distributed cache). Other techs involved where Redis,
> > > Memcached, Ehacache (Terracotta). With this project we gained some
> > insights
> > > about the internals of hadoop (new api. old api, how tasks work, hadoop
> > > serialization, the daemons running etc.) and hdfs, deployed clusters on
> > > cloud computing platforms (Openstack with Nova,  Amazon EC2 with boto).
> > We
> > > also used the Java Remote API for some tests.
> > > > Unfortunately, I have not used Stratosphere before in a research
> /prod
> > > environment. I have only played with the examples on my local machine.
> It
> > > is very interesting and I would love to learn more.
> > > > There will probably be a learning curve for me on the Stratosphere
> side
> > > but implementing a Hadoop Compatibility Layer seems like a very
> > interesting
> > > project and I believe I can be of use :)
> > > > Finally, I was wondering whether there are some command-line tools
> for
> > > deploying Stratosphere automatically for EC2 or Openstack clouds (for
> > > example, Stratosphere specific abstractions on top of python boto api).
> > Do
> > > you that would make sense as a project?
> > > > Pardon me for the length of this.
> > > > Kind regards,
> > > > Artem
> >
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "stratosphere-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to stratosphere-dev+unsubscribe@googlegroups.com.
> > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > For more options, visit https://groups.google.com/d/optout.
> >
>

Re: [stratosphere-dev] Re: Project for GSoC

Posted by Anirvan BASU <an...@inria.fr>.
Hello Artem, Fabian and Robert,

Thank you for your valuable inputs.

Here is what we are trying to do currently (scratching the surface):

Consider typical Hadoop commands of the form:
hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -D stream.num.map.output.key.fields=2 -D mapred.text.key.partitioner.options=-1,1 -inputformat org.apache.hadoop.mapred.TextInputFormat -input stocks/input/stocks.txt -apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Here we run a hadoop job with some parameters and keys specified.

hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -input /ngrams/input -output /ngrams/output -mapper ./python/mapper.py -file ./python/mapper.py -combiner ./python/reducer.py -reducer ./python/reducer.py -file ./python/reducer.py -jobconf stream.num.map.output.key.fields=3 -jobconf stream.num.reduce.output.key.fields=3 -jobconf mapred.reduce.tasks=10
Here we run a hadoop job with some parameters specified and using external scripts (in Python - as you have developed a Python connector, I cite this example) for mapper, combiner, reducer.

The above are slightly complicated than "hello world" type wordcount examples, ain't it?

Hence, 
We would like to know how to wrap and run the above jobs (if possible) using flink.
In which level of compatibility would you classify the above jobs?

If possible, could you help us by:
- showing how to "wrap" around such commands?
- what dependencies, sources, classes (to archetype in the pom.xml if required to make a package) OR (to import if required to make a jar file)?

Honestly, after drowning in all the information, we are trying to understand the (flink) elephant as blind mice. We understand it has legs, trunk, etc ... but don't understand where to start to mount and ride it.

Happy to clarify again & again until we reach success.

Thank you for your understanding and all your kind help,
Anirvan

----- Original Message -----
> From: "Fabian Hueske" <fh...@apache.org>
> To: dev@flink.incubator.apache.org
> Sent: Wednesday, August 27, 2014 10:04:18 AM
> Subject: Re: [stratosphere-dev] Re: Project for GSoC
> 
> Hi Anirvan,
> 
> we have a JIRA that tracks the HadoopCompatibility feature:
> https://apache.org/jira/browse/FLINK-838
> <https://issues.apache.org/jira/browse/FLINK-838>
> The basic mechanism is done, but has not been merged because we found a
> cleaner way to integrate the feature.
> 
> There are different levels of Hadoop Compatibility and I did not fully
> understand which kind of compatibility you need.
> 
> Conceptual Compatibility:
> Flink already offers UDF interfaces which are very similar to Hadoop's Map,
> Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce). These
> interfaces are not source-code compatible, but porting the code should be
> trivial. This is already there. If you're fine with porting your code from
> Hadoop to Flink (which should be a very small effort) you're good to go.
> 
> Function-Level-Source Compatibility:
> Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink
> programs is not difficult (since they are very similar to their Flink
> versions). This is not yet included in the codebase but could be added at
> rather short notice. We do already have interface compatibility for Hadoop
> Input- and OutputFormats. So you can use these without changing the code.
> 
> Job-Level-Source Compatibility:
> This is what we aimed for with the GSoC project. Here we want to support
> the execution of a full Hadoop MR Job in Flink without changing the code.
> However, this turned out to be a bit tricky if custom partitioner, sort,
> and grouping-comparators are used in the job. Adding this feature will take
> a bit more time.
> 
> Function- and Job-Level-Source compatibility will enable the use of already
> existing Hadoop code. If you are implementing new analysis jobs anyways,
> I'd go for a Flink implementation which eases many things such as secondary
> sort, unions of multiple inputs, etc.
> 
> Cheers,
> Fabian
> 
> 
> 
> 
> 2014-08-26 11:29 GMT+02:00 Robert Metzger <rm...@apache.org>:
> 
> > Hi Anirvan,
> >
> > I'm forwarding this message to dev@flink.incubator.apache.org. You need to
> > send a (empty) message to dev-subscribe@flink.incubator.apache.org to
> > subscribe to the dev list.
> > The dev@ list is for discussions with the developers, planning etc. The
> > user@flink.incubator.apache.org list is for user questions (for example
> > troubles using the API, conceptual questions etc.)
> > I think the message below is more suited for the dev@ list, since its
> > basically a feature request.
> >
> > Regarding the names: We don't use Stratosphere anymore. Our codebase has
> > been renamed to Flink and the "org.apache.flink" namespace. So ideally this
> > confusion is finally out of the world.
> >
> > For those who want to have a look into the history of the message, see the
> > Google Groups archive here:
> > https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ
> >
> > ---------- Forwarded message ----------
> > From: Nirvanesque <ni...@gmail.com>
> > Date: Tue, Aug 26, 2014 at 11:12 AM
> > Subject: [stratosphere-dev] Re: Project for GSoC
> > To: stratosphere-dev@googlegroups.com
> > Cc: tsekis79@gmail.com
> >
> >
> > Hello Artem and mentors,
> >
> > First of all nice greetings from INRIA, France.
> > Hope you had an enjoyable experience in GSOC!
> > Thanks to Robert (rmetzger) for forwarding me here ...
> >
> > At INRIA, we are starting to adopt Stratosphere / Flink.
> > The top-level goal is to enhance performance in User Defined Functions
> > (UDFs) with long workflows using multiple M-R, by using the larger set of
> > Second Order Functions (SOFs) in Stratosphere / Flink.
> > We will demonstrate this improvement by implementing some Use Cases for
> > business purposes.
> > For this purpose, we have chosen some customer analysis Use Cases using
> > weblogs and related data, for 2 companies (who appeared interested to try
> > using Stratosphere / Flink )
> > - a mobile phone app developer: http://www.tribeflame.com
> > - an anti-virus & Internet security software company: www.f-secure.com
> > I will be happy to share with you these Use Cases, if you are interested.
> > Just ask me here.
> >
> > At present, we are typically in the profiles of Alice-Bob-Sam, as described
> > in your GSoC proposal
> > <
> > https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> > >.
> > :-)
> > Hadoop seems to be the starting square for the Stratosphere / Flink
> > journey.
> > Same is the situation with developers in the above 2 companies :-)
> >
> > Briefly,
> > We have installed and run some example programmes from Flink / Stratosphere
> > (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our Hadoop &
> > Stratosphere installations)
> > We have some good understanding of Hadoop and its use in Streaming and
> > Pipes in conjunction with scripting languages (Python & R specifically)
> > In the first phase, we would like to run some "Hadoop-like" jobs (mainly
> > multiple M-R workflows) on Stratosphere, preferably with extensive Java or
> > Scala programming.
> > I refer to your GSoC project map
> > <
> > https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
> > >
> > which seems very interesting.
> > If we could have a Hadoop abstraction as you have mentioned, that would be
> > ideal for our first phase.
> > In later phases, when we implement complex join and group operations, we
> > would dive deeper into Stratosphere / Flink Java or Scala APIs
> >
> > Hence, I would like to know, what is the current status in this direction?
> > What has been implemented already? In which version onwards? How to try
> > them?
> > What is yet to be implemented? When - which versions?
> >
> > You may also like to see my discussion with Robert on this page
> > <
> > http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261
> > >.
> >
> > I am still mining in different discussions - here as well as on JIRA.
> > Please do refer me to the relevant links, JIRA tickets, etc if that saves
> > your time in re-typing large replies.
> > It will also help us to understand the train of collective thinking in the
> > Stratosphere / Flink roadmap.
> >
> > Thanks in advance,
> > Anirvan
> > PS : Apologies for using names / rechristened names (e.g. Flink /
> > Stratosphere) as I am not sure, which name exactly to use currently.
> >
> >
> > On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis wrote:
> > >
> > > Hello Fabian,
> > >
> > > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com wrote:
> > > > Hi Artem,
> > > >
> > > > thanks a lot for your interest in Stratosphere and participating in our
> > > GSoC projects!
> > > >
> > > > As you know, Hadoop is the big elephant out there in the Big Data
> > jungle
> > > and widely adopted. Therefore, a Hadoop compatibility layer is a very!
> > > important feature for any large scale data processing system.
> > > > Stratosphere builds on foundations of MapReduce but generalizes its
> > > concepts and provides a more efficient runtime.
> > >
> > > Great!
> > >
> > > > When you have a look at the Stratosphere WordCount example program, you
> > > will see, that the programming principles of Stratosphere and Hadoop
> > > MapReduce are quite similar, although Stratosphere is not compatible with
> > > the Hadoop interfaces.
> > >
> > > Yes, I've looked into the example (Wordcount, k-means) I also run the big
> > > test job you have locally and it seems to be ok.
> > >
> > > > With the proposed project we want to achieve, that Hadoop MapReduce
> > jobs
> > > can be executed on Stratosphere without changing a line of code (if
> > > possible).
> > > >
> > > > We have already some pieces for that in place. InputFormats are done
> > > (see https://github.com/stratosphere/stratosphere/
> > > tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats are
> > > work in progress. The biggest missing piece is executing Hadoop Map and
> > > Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces
> > (e.g.,
> > > overwriting partitioning function and sorting comparators, counters,
> > > distributed cache, ...). It would of course be desirable to support as
> > many
> > > of these interfaces as possible, but they can by added step-by-step once
> > > the first Hadoop jobs are running on Stratosphere.
> > >
> > > So If I understand correctly, the idea is to create logical wrappers for
> > > all interfaces used by Hadoop Jobs (the way it has been done with the
> > > hadoop datatypes) so it can be run as completely transparently as
> > possible
> > > on Stratosphere in an efficient way. I agree, there are many interfaces,
> > > but it's very interesting considering the way Stratosphere defines tasks,
> > > which is a bit different (though, as you said, the principle is similar).
> > >
> > > I assume the focus is on the YARN version of Hadoop (new api)?
> > >
> > > And one last question, serialization for Stratosphere is java's default
> > > mechanism, right?
> > >
> > > >
> > > > Regarding your question about cloud deployment scripts, one of our team
> > > members is currently working on this (see this thread:
> > > https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
> > > > I am not sure, if this is still in the making or already done. If you
> > > are interested in this as well, just drop a line to the thread.
> > Although, I
> > > am not very familiar with the detail of this, my gut feeling is that this
> > > would be a bit too less for an individual project. However, there might
> > be
> > > ways to extend this. So if you have any ideas, share them with us and we
> > > will be happy to discuss them.
> > >
> > > Thank you for pointing up the topic. I will let you know if I come up
> > with
> > > anything for this. Probably after I try deploying it on openstack.
> > >
> > > >
> > > > Again, thanks a lot for your interest and please don't hesitate to ask
> > > questions. :-)
> > >
> > > Thank you for the helpful answers.
> > >
> > > Kind regards,
> > > Artem
> > >
> > >
> > > >
> > > > Best,
> > > > Fabian
> > > >
> > > >
> > > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, tsek...@gmail.com
> > > wrote:
> > > > Dear Stratosphere devs and fellow GSoC potential students,
> > > > Hello!
> > > > I'm Artem, an undergraduate student from Athens, Greece. You can find
> > me
> > > on github (https://github.com/atsikiridis) and occasionally on
> > > stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis).
> > > Currently, however, I'm in Switzerland where I am doing my internship at
> > > CERN as back-end software developer for INSPIRE, a library for High
> > Energy
> > > Physics (we're running on http://inspirehep.net/). The service is in
> > > python( based on the open-source project http://invenio.net) and my
> > > responsibilities are mostly the integration with Redis, database
> > > abstractions, testing (unit, regression) and helping
> > > > our team to integrate modern technologies and frameworks to the current
> > > code base.
> > > > Moreover, I am very interested in big data technologies, therefore
> > > before coming to CERN I've been trying  to make my first steps in
> > research
> > > at the Big Data lab of AUEB, my home university. Mostly, the main
> > objective
> > > of the project I had been involved with, was the implementation of a
> > > dynamic caching mechanism for Hadoop (in a way trying our cache instead
> > of
> > > the built-in distributed cache). Other techs involved where Redis,
> > > Memcached, Ehacache (Terracotta). With this project we gained some
> > insights
> > > about the internals of hadoop (new api. old api, how tasks work, hadoop
> > > serialization, the daemons running etc.) and hdfs, deployed clusters on
> > > cloud computing platforms (Openstack with Nova,  Amazon EC2 with boto).
> > We
> > > also used the Java Remote API for some tests.
> > > > Unfortunately, I have not used Stratosphere before in a research /prod
> > > environment. I have only played with the examples on my local machine. It
> > > is very interesting and I would love to learn more.
> > > > There will probably be a learning curve for me on the Stratosphere side
> > > but implementing a Hadoop Compatibility Layer seems like a very
> > interesting
> > > project and I believe I can be of use :)
> > > > Finally, I was wondering whether there are some command-line tools for
> > > deploying Stratosphere automatically for EC2 or Openstack clouds (for
> > > example, Stratosphere specific abstractions on top of python boto api).
> > Do
> > > you that would make sense as a project?
> > > > Pardon me for the length of this.
> > > > Kind regards,
> > > > Artem
> >
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "stratosphere-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to stratosphere-dev+unsubscribe@googlegroups.com.
> > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > For more options, visit https://groups.google.com/d/optout.
> >
> 

Re: [stratosphere-dev] Re: Project for GSoC

Posted by Fabian Hueske <fh...@apache.org>.
Hi Anirvan,

we have a JIRA that tracks the HadoopCompatibility feature:
https://apache.org/jira/browse/FLINK-838
<https://issues.apache.org/jira/browse/FLINK-838>
The basic mechanism is done, but has not been merged because we found a
cleaner way to integrate the feature.

There are different levels of Hadoop Compatibility and I did not fully
understand which kind of compatibility you need.

Conceptual Compatibility:
Flink already offers UDF interfaces which are very similar to Hadoop's Map,
Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce). These
interfaces are not source-code compatible, but porting the code should be
trivial. This is already there. If you're fine with porting your code from
Hadoop to Flink (which should be a very small effort) you're good to go.

Function-Level-Source Compatibility:
Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink
programs is not difficult (since they are very similar to their Flink
versions). This is not yet included in the codebase but could be added at
rather short notice. We do already have interface compatibility for Hadoop
Input- and OutputFormats. So you can use these without changing the code.

Job-Level-Source Compatibility:
This is what we aimed for with the GSoC project. Here we want to support
the execution of a full Hadoop MR Job in Flink without changing the code.
However, this turned out to be a bit tricky if custom partitioner, sort,
and grouping-comparators are used in the job. Adding this feature will take
a bit more time.

Function- and Job-Level-Source compatibility will enable the use of already
existing Hadoop code. If you are implementing new analysis jobs anyways,
I'd go for a Flink implementation which eases many things such as secondary
sort, unions of multiple inputs, etc.

Cheers,
Fabian




2014-08-26 11:29 GMT+02:00 Robert Metzger <rm...@apache.org>:

> Hi Anirvan,
>
> I'm forwarding this message to dev@flink.incubator.apache.org. You need to
> send a (empty) message to dev-subscribe@flink.incubator.apache.org to
> subscribe to the dev list.
> The dev@ list is for discussions with the developers, planning etc. The
> user@flink.incubator.apache.org list is for user questions (for example
> troubles using the API, conceptual questions etc.)
> I think the message below is more suited for the dev@ list, since its
> basically a feature request.
>
> Regarding the names: We don't use Stratosphere anymore. Our codebase has
> been renamed to Flink and the "org.apache.flink" namespace. So ideally this
> confusion is finally out of the world.
>
> For those who want to have a look into the history of the message, see the
> Google Groups archive here:
> https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ
>
> ---------- Forwarded message ----------
> From: Nirvanesque <ni...@gmail.com>
> Date: Tue, Aug 26, 2014 at 11:12 AM
> Subject: [stratosphere-dev] Re: Project for GSoC
> To: stratosphere-dev@googlegroups.com
> Cc: tsekis79@gmail.com
>
>
> Hello Artem and mentors,
>
> First of all nice greetings from INRIA, France.
> Hope you had an enjoyable experience in GSOC!
> Thanks to Robert (rmetzger) for forwarding me here ...
>
> At INRIA, we are starting to adopt Stratosphere / Flink.
> The top-level goal is to enhance performance in User Defined Functions
> (UDFs) with long workflows using multiple M-R, by using the larger set of
> Second Order Functions (SOFs) in Stratosphere / Flink.
> We will demonstrate this improvement by implementing some Use Cases for
> business purposes.
> For this purpose, we have chosen some customer analysis Use Cases using
> weblogs and related data, for 2 companies (who appeared interested to try
> using Stratosphere / Flink )
> - a mobile phone app developer: http://www.tribeflame.com
> - an anti-virus & Internet security software company: www.f-secure.com
> I will be happy to share with you these Use Cases, if you are interested.
> Just ask me here.
>
> At present, we are typically in the profiles of Alice-Bob-Sam, as described
> in your GSoC proposal
> <
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> >.
> :-)
> Hadoop seems to be the starting square for the Stratosphere / Flink
> journey.
> Same is the situation with developers in the above 2 companies :-)
>
> Briefly,
> We have installed and run some example programmes from Flink / Stratosphere
> (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our Hadoop &
> Stratosphere installations)
> We have some good understanding of Hadoop and its use in Streaming and
> Pipes in conjunction with scripting languages (Python & R specifically)
> In the first phase, we would like to run some "Hadoop-like" jobs (mainly
> multiple M-R workflows) on Stratosphere, preferably with extensive Java or
> Scala programming.
> I refer to your GSoC project map
> <
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
> >
> which seems very interesting.
> If we could have a Hadoop abstraction as you have mentioned, that would be
> ideal for our first phase.
> In later phases, when we implement complex join and group operations, we
> would dive deeper into Stratosphere / Flink Java or Scala APIs
>
> Hence, I would like to know, what is the current status in this direction?
> What has been implemented already? In which version onwards? How to try
> them?
> What is yet to be implemented? When - which versions?
>
> You may also like to see my discussion with Robert on this page
> <
> http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261
> >.
>
> I am still mining in different discussions - here as well as on JIRA.
> Please do refer me to the relevant links, JIRA tickets, etc if that saves
> your time in re-typing large replies.
> It will also help us to understand the train of collective thinking in the
> Stratosphere / Flink roadmap.
>
> Thanks in advance,
> Anirvan
> PS : Apologies for using names / rechristened names (e.g. Flink /
> Stratosphere) as I am not sure, which name exactly to use currently.
>
>
> On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis wrote:
> >
> > Hello Fabian,
> >
> > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com wrote:
> > > Hi Artem,
> > >
> > > thanks a lot for your interest in Stratosphere and participating in our
> > GSoC projects!
> > >
> > > As you know, Hadoop is the big elephant out there in the Big Data
> jungle
> > and widely adopted. Therefore, a Hadoop compatibility layer is a very!
> > important feature for any large scale data processing system.
> > > Stratosphere builds on foundations of MapReduce but generalizes its
> > concepts and provides a more efficient runtime.
> >
> > Great!
> >
> > > When you have a look at the Stratosphere WordCount example program, you
> > will see, that the programming principles of Stratosphere and Hadoop
> > MapReduce are quite similar, although Stratosphere is not compatible with
> > the Hadoop interfaces.
> >
> > Yes, I've looked into the example (Wordcount, k-means) I also run the big
> > test job you have locally and it seems to be ok.
> >
> > > With the proposed project we want to achieve, that Hadoop MapReduce
> jobs
> > can be executed on Stratosphere without changing a line of code (if
> > possible).
> > >
> > > We have already some pieces for that in place. InputFormats are done
> > (see https://github.com/stratosphere/stratosphere/
> > tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats are
> > work in progress. The biggest missing piece is executing Hadoop Map and
> > Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces
> (e.g.,
> > overwriting partitioning function and sorting comparators, counters,
> > distributed cache, ...). It would of course be desirable to support as
> many
> > of these interfaces as possible, but they can by added step-by-step once
> > the first Hadoop jobs are running on Stratosphere.
> >
> > So If I understand correctly, the idea is to create logical wrappers for
> > all interfaces used by Hadoop Jobs (the way it has been done with the
> > hadoop datatypes) so it can be run as completely transparently as
> possible
> > on Stratosphere in an efficient way. I agree, there are many interfaces,
> > but it's very interesting considering the way Stratosphere defines tasks,
> > which is a bit different (though, as you said, the principle is similar).
> >
> > I assume the focus is on the YARN version of Hadoop (new api)?
> >
> > And one last question, serialization for Stratosphere is java's default
> > mechanism, right?
> >
> > >
> > > Regarding your question about cloud deployment scripts, one of our team
> > members is currently working on this (see this thread:
> > https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
> > > I am not sure, if this is still in the making or already done. If you
> > are interested in this as well, just drop a line to the thread.
> Although, I
> > am not very familiar with the detail of this, my gut feeling is that this
> > would be a bit too less for an individual project. However, there might
> be
> > ways to extend this. So if you have any ideas, share them with us and we
> > will be happy to discuss them.
> >
> > Thank you for pointing up the topic. I will let you know if I come up
> with
> > anything for this. Probably after I try deploying it on openstack.
> >
> > >
> > > Again, thanks a lot for your interest and please don't hesitate to ask
> > questions. :-)
> >
> > Thank you for the helpful answers.
> >
> > Kind regards,
> > Artem
> >
> >
> > >
> > > Best,
> > > Fabian
> > >
> > >
> > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, tsek...@gmail.com
> > wrote:
> > > Dear Stratosphere devs and fellow GSoC potential students,
> > > Hello!
> > > I'm Artem, an undergraduate student from Athens, Greece. You can find
> me
> > on github (https://github.com/atsikiridis) and occasionally on
> > stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis).
> > Currently, however, I'm in Switzerland where I am doing my internship at
> > CERN as back-end software developer for INSPIRE, a library for High
> Energy
> > Physics (we're running on http://inspirehep.net/). The service is in
> > python( based on the open-source project http://invenio.net) and my
> > responsibilities are mostly the integration with Redis, database
> > abstractions, testing (unit, regression) and helping
> > > our team to integrate modern technologies and frameworks to the current
> > code base.
> > > Moreover, I am very interested in big data technologies, therefore
> > before coming to CERN I've been trying  to make my first steps in
> research
> > at the Big Data lab of AUEB, my home university. Mostly, the main
> objective
> > of the project I had been involved with, was the implementation of a
> > dynamic caching mechanism for Hadoop (in a way trying our cache instead
> of
> > the built-in distributed cache). Other techs involved where Redis,
> > Memcached, Ehacache (Terracotta). With this project we gained some
> insights
> > about the internals of hadoop (new api. old api, how tasks work, hadoop
> > serialization, the daemons running etc.) and hdfs, deployed clusters on
> > cloud computing platforms (Openstack with Nova,  Amazon EC2 with boto).
> We
> > also used the Java Remote API for some tests.
> > > Unfortunately, I have not used Stratosphere before in a research /prod
> > environment. I have only played with the examples on my local machine. It
> > is very interesting and I would love to learn more.
> > > There will probably be a learning curve for me on the Stratosphere side
> > but implementing a Hadoop Compatibility Layer seems like a very
> interesting
> > project and I believe I can be of use :)
> > > Finally, I was wondering whether there are some command-line tools for
> > deploying Stratosphere automatically for EC2 or Openstack clouds (for
> > example, Stratosphere specific abstractions on top of python boto api).
> Do
> > you that would make sense as a project?
> > > Pardon me for the length of this.
> > > Kind regards,
> > > Artem
>
>  --
> You received this message because you are subscribed to the Google Groups
> "stratosphere-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to stratosphere-dev+unsubscribe@googlegroups.com.
> Visit this group at http://groups.google.com/group/stratosphere-dev.
> For more options, visit https://groups.google.com/d/optout.
>