You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Keren Ouaknine <ke...@gmail.com> on 2015/07/14 21:44:47 UTC

PigMix extension

Hi,

I am working on expanding the PigMix benchmark.
I am interested to add queries matching more realistic use cases, such as
finding what are the highest revenue of a page or what is the burst of
activity for a specific page. Additionally, I would like to add OLTP-like
queries such as finding other users from the same neighborhood looking at a
specific page.

The current PigMix table does not have an id for a page access (see details
on page_views here <https://cwiki.apache.org/confluence/display/PIG/PigMix>).
Therefore I cannot run the above queries.

I am wondering why was this field omitted from the schema of page_views?
It seems a fundamental field for all aggregation queries on page_views.

I see two options: either there is another use case that this schema
targets (what is it?) or the benchmark's goal is not to target real use
cases and is merely oriented towards a synthetic performance and
measurement goal.

Any ideas?

Thank you,
Keren

​PS: I sent this email to both the devs and users' mailing list, not to
spam us :) but because these queries are both a users and a development
concern.​


-- 
Keren Ouaknine
www.kereno.com

Re: PigMix extension

Posted by Alan Gates <al...@gmail.com>.
The initial goal of PigMix was definitely to give the project a way to 
measure itself against MapReduce and between different versions of 
releases.  So that falls into your synthetic category.

That said, if adding a field enables extending the bench mark into new 
territory and makes it more useful then that seems like a clear win.

Alan.

> Keren Ouaknine <ma...@gmail.com>
> July 14, 2015 at 12:44
> Hi,
>
> I am working on expanding the PigMix benchmark.
> I am interested to add queries matching more realistic use cases, such as
> finding what are the highest revenue of a page or what is the burst of
> activity for a specific page. Additionally, I would like to add OLTP-like
> queries such as finding other users from the same neighborhood looking 
> at a
> specific page.
>
> The current PigMix table does not have an id for a page access (see 
> details
> on page_views here 
> <https://cwiki.apache.org/confluence/display/PIG/PigMix>).
> Therefore I cannot run the above queries.
>
> I am wondering why was this field omitted from the schema of page_views?
> It seems a fundamental field for all aggregation queries on page_views.
>
> I see two options: either there is another use case that this schema
> targets (what is it?) or the benchmark's goal is not to target real use
> cases and is merely oriented towards a synthetic performance and
> measurement goal.
>
> Any ideas?
>
> Thank you,
> Keren
>
> ​PS: I sent this email to both the devs and users' mailing list, not to
> spam us :) but because these queries are both a users and a development
> concern. ​
>
>

Re: PigMix extension

Posted by Alan Gates <al...@gmail.com>.
The initial goal of PigMix was definitely to give the project a way to 
measure itself against MapReduce and between different versions of 
releases.  So that falls into your synthetic category.

That said, if adding a field enables extending the bench mark into new 
territory and makes it more useful then that seems like a clear win.

Alan.

> Keren Ouaknine <ma...@gmail.com>
> July 14, 2015 at 12:44
> Hi,
>
> I am working on expanding the PigMix benchmark.
> I am interested to add queries matching more realistic use cases, such as
> finding what are the highest revenue of a page or what is the burst of
> activity for a specific page. Additionally, I would like to add OLTP-like
> queries such as finding other users from the same neighborhood looking 
> at a
> specific page.
>
> The current PigMix table does not have an id for a page access (see 
> details
> on page_views here 
> <https://cwiki.apache.org/confluence/display/PIG/PigMix>).
> Therefore I cannot run the above queries.
>
> I am wondering why was this field omitted from the schema of page_views?
> It seems a fundamental field for all aggregation queries on page_views.
>
> I see two options: either there is another use case that this schema
> targets (what is it?) or the benchmark's goal is not to target real use
> cases and is merely oriented towards a synthetic performance and
> measurement goal.
>
> Any ideas?
>
> Thank you,
> Keren
>
> ​PS: I sent this email to both the devs and users' mailing list, not to
> spam us :) but because these queries are both a users and a development
> concern. ​
>
>