You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mat Schaffer <ma...@schaffer.me> on 2016/05/23 05:28:34 UTC

Spark for offline log processing/querying

I'm curious about trying to use spark as a cheap/slow ELK
(ElasticSearch,Logstash,Kibana) system. Thinking something like:

- instances rotate local logs
- copy rotated logs to s3
(s3://logs/region/grouping/instance/service/*.logs)
- spark to convert from raw text logs to parquet
- maybe presto to query the parquet?

I'm still new on Spark though, so thought I'd ask if anyone was familiar
with this sort of thing and if there are maybe some articles or documents I
should be looking at in order to learn how to build such a thing. Or if
such a thing even made sense.

Thanks in advance, and apologies if this has already been asked and I
missed it!

-Mat

matschaffer.com

Re: Spark for offline log processing/querying

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

We also did some benchmarking using analytical queries similar to TPC-H
both with Spark and Presto, and our conclussion was that Spark is a great
general solution but for analytical SQL queries it is still not there yet.
I mean for 10 or 100GB of data you will get your results back but with
Presto was way faster and predictable. But of course if you are planning to
do machine learning or ad-hoc data processing, then Spark is the right
solution.


Renato M.

2016-05-23 9:38 GMT+02:00 Mat Schaffer <ma...@schaffer.me>:

> It's only really mildly interactive. When I used presto+hive in the past
> (just a consumer not an admin) it seemed to be able to provide answers
> within ~2m even for fairly large data sets. Hoping I can get a similar
> level of responsiveness with spark.
>
> Thanks, Sonal! I'll take a look at the example log processor and see what
> I can come up with.
>
>
> -Mat
>
> matschaffer.com
>
> On Mon, May 23, 2016 at 3:08 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Do you want to replace ELK by Spark? Depending on your queries you could
>> do as you proposed. However, many of the text analytics queries will
>> probably be much faster on ELK. If your queries are more interactive and
>> not about batch processing then it does not make so much sense. I am not
>> sure why you plan to use Presto.
>>
>> On 23 May 2016, at 07:28, Mat Schaffer <ma...@schaffer.me> wrote:
>>
>> I'm curious about trying to use spark as a cheap/slow ELK
>> (ElasticSearch,Logstash,Kibana) system. Thinking something like:
>>
>> - instances rotate local logs
>> - copy rotated logs to s3
>> (s3://logs/region/grouping/instance/service/*.logs)
>> - spark to convert from raw text logs to parquet
>> - maybe presto to query the parquet?
>>
>> I'm still new on Spark though, so thought I'd ask if anyone was familiar
>> with this sort of thing and if there are maybe some articles or documents I
>> should be looking at in order to learn how to build such a thing. Or if
>> such a thing even made sense.
>>
>> Thanks in advance, and apologies if this has already been asked and I
>> missed it!
>>
>> -Mat
>>
>> matschaffer.com
>>
>>
>

Re: Spark for offline log processing/querying

Posted by Mat Schaffer <ma...@schaffer.me>.

It's only really mildly interactive. When I used presto+hive in the past
(just a consumer not an admin) it seemed to be able to provide answers
within ~2m even for fairly large data sets. Hoping I can get a similar
level of responsiveness with spark.

Thanks, Sonal! I'll take a look at the example log processor and see what I
can come up with.


-Mat

matschaffer.com

On Mon, May 23, 2016 at 3:08 PM, Jörn Franke <jo...@gmail.com> wrote:

> Do you want to replace ELK by Spark? Depending on your queries you could
> do as you proposed. However, many of the text analytics queries will
> probably be much faster on ELK. If your queries are more interactive and
> not about batch processing then it does not make so much sense. I am not
> sure why you plan to use Presto.
>
> On 23 May 2016, at 07:28, Mat Schaffer <ma...@schaffer.me> wrote:
>
> I'm curious about trying to use spark as a cheap/slow ELK
> (ElasticSearch,Logstash,Kibana) system. Thinking something like:
>
> - instances rotate local logs
> - copy rotated logs to s3
> (s3://logs/region/grouping/instance/service/*.logs)
> - spark to convert from raw text logs to parquet
> - maybe presto to query the parquet?
>
> I'm still new on Spark though, so thought I'd ask if anyone was familiar
> with this sort of thing and if there are maybe some articles or documents I
> should be looking at in order to learn how to build such a thing. Or if
> such a thing even made sense.
>
> Thanks in advance, and apologies if this has already been asked and I
> missed it!
>
> -Mat
>
> matschaffer.com
>
>

Re: Spark for offline log processing/querying

Posted by Jörn Franke <jo...@gmail.com>.

Do you want to replace ELK by Spark? Depending on your queries you could do as you proposed. However, many of the text analytics queries will probably be much faster on ELK. If your queries are more interactive and not about batch processing then it does not make so much sense. I am not sure why you plan to use Presto.

> On 23 May 2016, at 07:28, Mat Schaffer <ma...@schaffer.me> wrote:
> 
> I'm curious about trying to use spark as a cheap/slow ELK (ElasticSearch,Logstash,Kibana) system. Thinking something like:
> 
> - instances rotate local logs
> - copy rotated logs to s3 (s3://logs/region/grouping/instance/service/*.logs)
> - spark to convert from raw text logs to parquet
> - maybe presto to query the parquet?
> 
> I'm still new on Spark though, so thought I'd ask if anyone was familiar with this sort of thing and if there are maybe some articles or documents I should be looking at in order to learn how to build such a thing. Or if such a thing even made sense.
> 
> Thanks in advance, and apologies if this has already been asked and I missed it!
> 
> -Mat
> 
> matschaffer.com

Re: Spark for offline log processing/querying

Posted by Sonal Goyal <so...@gmail.com>.

Hi Mat,

I think you could also use spark SQL to query the logs. Hope the following
link helps

https://databricks.com/blog/2014/09/23/databricks-reference-applications.html
On May 23, 2016 10:59 AM, "Mat Schaffer" <ma...@schaffer.me> wrote:

> I'm curious about trying to use spark as a cheap/slow ELK
> (ElasticSearch,Logstash,Kibana) system. Thinking something like:
>
> - instances rotate local logs
> - copy rotated logs to s3
> (s3://logs/region/grouping/instance/service/*.logs)
> - spark to convert from raw text logs to parquet
> - maybe presto to query the parquet?
>
> I'm still new on Spark though, so thought I'd ask if anyone was familiar
> with this sort of thing and if there are maybe some articles or documents I
> should be looking at in order to learn how to build such a thing. Or if
> such a thing even made sense.
>
> Thanks in advance, and apologies if this has already been asked and I
> missed it!
>
> -Mat
>
> matschaffer.com
>