You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Stephan Buys <be...@stephanbuys.com> on 2016/05/25 11:43:05 UTC

Question from a data analytics/log management dude

Hi all,

Hope I'm in the right forum, I'm someone with about a decade's worth of log management/event analytics experience - for the last 2 years though we've been building our own solutions based on a variety of open source technologies. As hopefully some of you might appreciate, whenever you want to do something interesting, or at scale with timeseries/event data a lot of the tools are lacking.

I started off working in Splunk and it sort off spoiled me with end-user/administrator functionality from the get go (even if it prohibitively expensive and slow). In Splunk the 'sandpit' that you play in has all the toys a non developer can ask for: built-in map/reduce + streaming, and manipulation of results/streams through a simple DSL familiar to anyone with a bit of Unix CLI/Bash experience. (ie. search something | filter | map | eval | visualise http://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutsearchlanguagesyntax)

At the moment we spend our days in logstash + elasticsearch (and sundry).

I looked into Beam and Flink a bit and from a technical perspective it seems like the ideal direction to go, combining many sources of data (such as elasticsearch, influxdb, rethinkdb, etc) and many analytics use-cases. The only gotcha seems to be that, from what I can see, the target audience is almost always developers. This isn't a problem for myself, but ideally I would want to bolt a simple DSL (submittable via simple interfaces, such as cli) on top of my datasets but have all of the stream/batch processing capabilities that project like Flink allow.

Is anyone aware of projects/efforts along these lines? Ideas on how we could there from a project such as Apache Beam? (Am I being naive?)

Your input/perspectives are most welcome!

Kind regards,
Stephan Buys

Another question re dataflow architecture. (was Re: Question from a data analytics/log management dude)

Posted by Stephan Buys <be...@stephanbuys.com>.

What role/to what degree do native queries from sources come into play in the Dataflow architecture? I'm guessing the answer really depends on the situation, but in a micro-batch situation (or hybrid) I'm guessing as much of the heavy lifting as possible will still be relegated to the initial query/source ingest?


> On 26 May 2016, at 9:53 AM, Stephan Buys <be...@stephanbuys.com> wrote:
> 
> Thanks so much for the feedback (Max and JB) so far as well as the reference to the projects, my reading list keeps growing.
> 
> Continuing with my bad habit of just asking before I'm really familiar with a subject...
> 
> The more I look at the examples and read about the kinds of problems Dataflow/Beam attempt to solve I'm running into a perceived chasm between stacks such as ELK (elastic/logstash,etc) or Splunk and projects such as Apache Beam. I guess that even though the problems that are solved are the same in a strict sense Splunk/ELK/etc are more suited to querying/searching/investigation where projects such as Beam are well suited to being a pipeline feeding those systems, a pipeline integrating with those systems for realtime metrics/reporting as well as a pipeline for alerting/training.
> 
> In my mind a proper streaming system keeps looping back into and originating from a data store such as elasticsearch/hdfs. Am I on the right track? Is there a 'grand unified' vision for these kinds of systems that I can delve into a bit? 
> 
> Regards,
> Stephan
> 
>> On 25 May 2016, at 4:14 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>> 
>> Hi Stephan,
>> 
>> I created Karaf Decanter as an alternative to logstash/elasticsearch.
>> 
>> What you describe looks like a DSL to me (as a bit discussed here:
>> 
>> - Technical Vision
>> - http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>> 
>> I'm working on a PoC to mix Decanter with Beam, which can result to a DSL ;)
>> 
>> Regards
>> JB
>> 
>> On 05/25/2016 01:43 PM, Stephan Buys wrote:
>>> Hi all,
>>> 
>>> Hope I'm in the right forum, I'm someone with about a decade's worth of log management/event analytics experience - for the last 2 years though we've been building our own solutions based on a variety of open source technologies. As hopefully some of you might appreciate, whenever you want to do something interesting, or at scale with timeseries/event data a lot of the tools are lacking.
>>> 
>>> I started off working in Splunk and it sort off spoiled me with end-user/administrator functionality from the get go (even if it prohibitively expensive and slow). In Splunk the 'sandpit' that you play in has all the toys a non developer can ask for: built-in map/reduce + streaming, and manipulation of results/streams through a simple DSL familiar to anyone with a bit of Unix CLI/Bash experience. (ie. search something | filter | map | eval | visualise http://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutsearchlanguagesyntax)
>>> 
>>> At the moment we spend our days in logstash + elasticsearch (and sundry).
>>> 
>>> I looked into Beam and Flink a bit and from a technical perspective it seems like the ideal direction to go, combining many sources of data (such as elasticsearch, influxdb, rethinkdb, etc) and many analytics use-cases. The only gotcha seems to be that, from what I can see, the target audience is almost always developers. This isn't a problem for myself, but ideally I would want to bolt a simple DSL (submittable via simple interfaces, such as cli) on top of my datasets but have all of the stream/batch processing capabilities that project like Flink allow.
>>> 
>>> Is anyone aware of projects/efforts along these lines? Ideas on how we could there from a project such as Apache Beam? (Am I being naive?)
>>> 
>>> Your input/perspectives are most welcome!
>>> 
>>> Kind regards,
>>> Stephan Buys
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> -- 
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

Re: Question from a data analytics/log management dude

Posted by Stephan Buys <be...@stephanbuys.com>.

Thanks so much for the feedback (Max and JB) so far as well as the reference to the projects, my reading list keeps growing.

Continuing with my bad habit of just asking before I'm really familiar with a subject...

The more I look at the examples and read about the kinds of problems Dataflow/Beam attempt to solve I'm running into a perceived chasm between stacks such as ELK (elastic/logstash,etc) or Splunk and projects such as Apache Beam. I guess that even though the problems that are solved are the same in a strict sense Splunk/ELK/etc are more suited to querying/searching/investigation where projects such as Beam are well suited to being a pipeline feeding those systems, a pipeline integrating with those systems for realtime metrics/reporting as well as a pipeline for alerting/training.

In my mind a proper streaming system keeps looping back into and originating from a data store such as elasticsearch/hdfs. Am I on the right track? Is there a 'grand unified' vision for these kinds of systems that I can delve into a bit? 

Regards,
Stephan

> On 25 May 2016, at 4:14 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> 
> Hi Stephan,
> 
> I created Karaf Decanter as an alternative to logstash/elasticsearch.
> 
> What you describe looks like a DSL to me (as a bit discussed here:
> 
> - Technical Vision
> - http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
> 
> I'm working on a PoC to mix Decanter with Beam, which can result to a DSL ;)
> 
> Regards
> JB
> 
> On 05/25/2016 01:43 PM, Stephan Buys wrote:
>> Hi all,
>> 
>> Hope I'm in the right forum, I'm someone with about a decade's worth of log management/event analytics experience - for the last 2 years though we've been building our own solutions based on a variety of open source technologies. As hopefully some of you might appreciate, whenever you want to do something interesting, or at scale with timeseries/event data a lot of the tools are lacking.
>> 
>> I started off working in Splunk and it sort off spoiled me with end-user/administrator functionality from the get go (even if it prohibitively expensive and slow). In Splunk the 'sandpit' that you play in has all the toys a non developer can ask for: built-in map/reduce + streaming, and manipulation of results/streams through a simple DSL familiar to anyone with a bit of Unix CLI/Bash experience. (ie. search something | filter | map | eval | visualise http://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutsearchlanguagesyntax)
>> 
>> At the moment we spend our days in logstash + elasticsearch (and sundry).
>> 
>> I looked into Beam and Flink a bit and from a technical perspective it seems like the ideal direction to go, combining many sources of data (such as elasticsearch, influxdb, rethinkdb, etc) and many analytics use-cases. The only gotcha seems to be that, from what I can see, the target audience is almost always developers. This isn't a problem for myself, but ideally I would want to bolt a simple DSL (submittable via simple interfaces, such as cli) on top of my datasets but have all of the stream/batch processing capabilities that project like Flink allow.
>> 
>> Is anyone aware of projects/efforts along these lines? Ideas on how we could there from a project such as Apache Beam? (Am I being naive?)
>> 
>> Your input/perspectives are most welcome!
>> 
>> Kind regards,
>> Stephan Buys
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: Question from a data analytics/log management dude

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Stephan,

I created Karaf Decanter as an alternative to logstash/elasticsearch.

What you describe looks like a DSL to me (as a bit discussed here:

- Technical Vision
- http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/

I'm working on a PoC to mix Decanter with Beam, which can result to a DSL ;)

Regards
JB

On 05/25/2016 01:43 PM, Stephan Buys wrote:
> Hi all,
>
> Hope I'm in the right forum, I'm someone with about a decade's worth of log management/event analytics experience - for the last 2 years though we've been building our own solutions based on a variety of open source technologies. As hopefully some of you might appreciate, whenever you want to do something interesting, or at scale with timeseries/event data a lot of the tools are lacking.
>
> I started off working in Splunk and it sort off spoiled me with end-user/administrator functionality from the get go (even if it prohibitively expensive and slow). In Splunk the 'sandpit' that you play in has all the toys a non developer can ask for: built-in map/reduce + streaming, and manipulation of results/streams through a simple DSL familiar to anyone with a bit of Unix CLI/Bash experience. (ie. search something | filter | map | eval | visualise http://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutsearchlanguagesyntax)
>
> At the moment we spend our days in logstash + elasticsearch (and sundry).
>
> I looked into Beam and Flink a bit and from a technical perspective it seems like the ideal direction to go, combining many sources of data (such as elasticsearch, influxdb, rethinkdb, etc) and many analytics use-cases. The only gotcha seems to be that, from what I can see, the target audience is almost always developers. This isn't a problem for myself, but ideally I would want to bolt a simple DSL (submittable via simple interfaces, such as cli) on top of my datasets but have all of the stream/batch processing capabilities that project like Flink allow.
>
> Is anyone aware of projects/efforts along these lines? Ideas on how we could there from a project such as Apache Beam? (Am I being naive?)
>
> Your input/perspectives are most welcome!
>
> Kind regards,
> Stephan Buys
>
>
>
>
>

-- 
Jean-Baptiste Onofr
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Question from a data analytics/log management dude

Posted by Maximilian Michels <mx...@apache.org>.

Hi Stephan,

I can certainly imagine future DSLs on top of Apache Beam. However,
melting all the features of the Beam API into a DSL is not that easy
though. Likely, you will end up with something similar complex to use
as the existing API :)

There are projects that try to simplify Big Data processing and visualization:

Apache NiFi
https://nifi.apache.org/

Apache Zeppelin (incubating)
https://zeppelin.incubator.apache.org/

I would love to see those integrate with Apache Beam. Both of these
projects have integrated with Apache Flink in the past.

Best,
Max

On Wed, May 25, 2016 at 1:43 PM, Stephan Buys <be...@stephanbuys.com> wrote:
> Hi all,
>
> Hope I'm in the right forum, I'm someone with about a decade's worth of log management/event analytics experience - for the last 2 years though we've been building our own solutions based on a variety of open source technologies. As hopefully some of you might appreciate, whenever you want to do something interesting, or at scale with timeseries/event data a lot of the tools are lacking.
>
> I started off working in Splunk and it sort off spoiled me with end-user/administrator functionality from the get go (even if it prohibitively expensive and slow). In Splunk the 'sandpit' that you play in has all the toys a non developer can ask for: built-in map/reduce + streaming, and manipulation of results/streams through a simple DSL familiar to anyone with a bit of Unix CLI/Bash experience. (ie. search something | filter | map | eval | visualise http://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutsearchlanguagesyntax)
>
> At the moment we spend our days in logstash + elasticsearch (and sundry).
>
> I looked into Beam and Flink a bit and from a technical perspective it seems like the ideal direction to go, combining many sources of data (such as elasticsearch, influxdb, rethinkdb, etc) and many analytics use-cases. The only gotcha seems to be that, from what I can see, the target audience is almost always developers. This isn't a problem for myself, but ideally I would want to bolt a simple DSL (submittable via simple interfaces, such as cli) on top of my datasets but have all of the stream/batch processing capabilities that project like Flink allow.
>
> Is anyone aware of projects/efforts along these lines? Ideas on how we could there from a project such as Apache Beam? (Am I being naive?)
>
> Your input/perspectives are most welcome!
>
> Kind regards,
> Stephan Buys
>
>
>
>
>