You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@wayang.apache.org by Kaustubh Beedkar <kb...@gmail.com> on 2023/06/05 05:00:33 UTC

Re: Profiler and Cost Functions

Dear Josep,

Let me try to answer these. Please see my response inline below


On Wed, May 31, 2023 at 4:37 PM Josep Sampe Domenech
<Jo...@ibm.com.invalid> wrote:

> Thanks Jorge, this helps a lot to clarify the points related to the
> Genetic Optimizer.
>
>
> I have a few additional questions on the subject matter:
>
>
>   1.  Regarding The platforms: Do you consider adding support for using
> multiple Spark or Postgres instances simultaneously? I noticed there is a
> branch on GitHub dedicated to this purpose, specifically implemented for
> Spark. I'm curious to know if this is just a proof of concept or if it's
> something you plan to incorporate in the future.
>
> In theory, Wayang can support multiple instances of the same platform.
However, this would require a unique identifier for each platform and
subsequent changes. This is very much in our scheme of things for the near
future.


>
>   1.  Regarding the operators: In the Postgres platform, I can see the
> Executor, Filter, Projection, and TableSource operators. Currently, when I
> read two tables from Postgres and perform a JOIN operation, it appears that
> the JOIN is executed locally within the Wayang environment using the Java
> streams platform, rather than running the JOIN operation directly within
> Postgres itself. Is it because the Join operator in Postgres has not been
> implemented yet? Or is it because, based on the cost functions, it is
> considered more cost-effective to execute the JOIN locally? Or am I missing
> something?
>
In this case, the join operator is not yet implemented. We are in the
process of supporting join pushdowns as a part of Wayang SQL API.

>
>
>
>   1.  Regarding the cost functions: To clarify some things related to the
> section 4 of the paper: are you considering by default the cost of moving
> data between platforms? Is the cost of moving data between platforms taken
> into account in the conversion operators, like the SqlToStreamOperator?  If
> so, Should I add a custom cost-function template in the “network” key of
> the wayang.postgres.sqltostream.load.output.template to take this data
> movement into account? Or the data transfer cost between platforms is
> considered in a different place and I should do it in a different way?
>
I am not 100% sure about this but
https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-commons/wayang-core/src/main/java/org/apache/wayang/core/optimizer/channels/DefaultChannelConversion.java#L181
could be a pointer.

Best,
Kaustubh




>
> Thanks & best regards,
> Josep
>
>
> From: Jorge Arnulfo Quiané Ruiz <qu...@gmail.com>
> Date: Friday, 26 May 2023 at 11:55
> To: dev@wayang.apache.org <de...@wayang.apache.org>
> Subject: [EXTERNAL] Re: Profiler and Cost Functions
> Hello Josep,
>
> Replying with a bit of delay because I have been travelling this week :)
>
> Regarding your second point, we basically have two ways of learning the
> cost parameters of the execution operators: by analysing execution logs
> (using the genetic optimizer) or by profiling individual operators. The
> package you refer to is for the latter (profiling individual execution
> operators). This was our original idea to get the cost parameters but we
> quickly found out that this was going to be very off from the real costs
> because most big data platforms exploit operator pipelining which makes it
> hard to profile individually. So, you cannot use the output of this
> individual profiler for the genetic algorithm.
>
> So, let us now discuss your first point which is regarding the Genetic
> Optimizer. So this was our solution to tackle the problem of the individual
> operator profiling approach. The genetic optimizer, instead, tries to get
> the operator costs by analysing execution logs. For this, it requires both
> a cost function template per execution operator (which should be specified
> in a json format:
> https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-platforms/wayang-spark/code/main/resources/wayang-spark-defaults.properties
> ) and wayang execution logs (i.e. running jobs via Wayang). The genetic
> optimizer will learn the coefficients (denoted by ? In the template
> function). To actually understand how it does so, our VLDBJ paper (also in
> Arxiv) gives a bit more details about and a pointer for the genetic
> optimization we use:
> https://arxiv.org/pdf/1805.03533.pdfSection  3.2 and Figure 4.
>
> Let us know if that helps.
>
> Best,
> Jorge
>
> > On 24 May 2023, at 11.12, Josep Sampe Domenech <
> Josep.Sampe.Domenech@ibm.com.INVALID> wrote:
> >
> > Hello dev,
> >
> >
> >
> > We recently started our exploration of the Wayang project and we would
> like to gain a deeper understanding of the profiler tool and its
> functionalities, specifically about the collection and use of metrics.
> >
> >
> >
> > To enhance our comprehension, we would appreciate your assistance in
> addressing the following queries:
> >
> >
> >
> >  1.  Could you please provide us with an explanation of how the
> GeneticOptimizerApp works? Specifically, we would like to understand which
> information from the executions.json file is taken into consideration when
> calculating the "?" parameters in the cost functions. Additionally, we are
> interested in learning more about the methodology employed to calculate the
> "?" values.
> >
> >
> >
> >  1.  We are also curious about the purpose of the profiler.spark
> package. What is the purpose of this package? Does it serve a specific
> objective?, and can the results obtained from this profiler.spark be
> utilized or integrated into the GeneticOptimizerApp?
> >
> >
> >
> >
> >
> > Thank you in advance for your time and attention. We look forward to
> your response.
> >
> >
> >
> > Best regards,
> >
> > Josep
> >
>

Re: Profiler and Cost Functions

Posted by Zoi Kaoudi <zk...@yahoo.gr.INVALID>.
 Link above doesn't work.. here it is again:https://wayang.apache.org/assets/pdf/paper/icde20.pdf
Best
--
Zoi

    Στις Τετάρτη 14 Ιουνίου 2023 στις 09:56:42 π.μ. CEST, ο χρήστης Zoi Kaoudi <zk...@yahoo.gr.invalid> έγραψε:  
 
  Hi Josep,
to answer your question about the data movement costs, you are right. Data movement is encoded with the conversion operators and thus their cost can be defined in the same way as all other operators.
Just a side note: To facilitate the cost model tuning for users, we have also another version of the optimizer which uses ML to predict the cost of execution plans. You can see details in our published paper of 2020: https://wayang.apache.org/assets/pdf/paper/icde20.pdfIt's not yet incorporated in Wayang but it is in our next steps.
Best
--
Zoi

    Στις Δευτέρα 5 Ιουνίου 2023 στις 07:00:51 π.μ. CEST, ο χρήστης Kaustubh Beedkar <kb...@gmail.com> έγραψε:  
 
 Dear Josep,

Let me try to answer these. Please see my response inline below


On Wed, May 31, 2023 at 4:37 PM Josep Sampe Domenech
<Jo...@ibm.com.invalid> wrote:

> Thanks Jorge, this helps a lot to clarify the points related to the
> Genetic Optimizer.
>
>
> I have a few additional questions on the subject matter:
>
>
>  1.  Regarding The platforms: Do you consider adding support for using
> multiple Spark or Postgres instances simultaneously? I noticed there is a
> branch on GitHub dedicated to this purpose, specifically implemented for
> Spark. I'm curious to know if this is just a proof of concept or if it's
> something you plan to incorporate in the future.
>
> In theory, Wayang can support multiple instances of the same platform.
However, this would require a unique identifier for each platform and
subsequent changes. This is very much in our scheme of things for the near
future.


>
>  1.  Regarding the operators: In the Postgres platform, I can see the
> Executor, Filter, Projection, and TableSource operators. Currently, when I
> read two tables from Postgres and perform a JOIN operation, it appears that
> the JOIN is executed locally within the Wayang environment using the Java
> streams platform, rather than running the JOIN operation directly within
> Postgres itself. Is it because the Join operator in Postgres has not been
> implemented yet? Or is it because, based on the cost functions, it is
> considered more cost-effective to execute the JOIN locally? Or am I missing
> something?
>
In this case, the join operator is not yet implemented. We are in the
process of supporting join pushdowns as a part of Wayang SQL API.

>
>
>
>  1.  Regarding the cost functions: To clarify some things related to the
> section 4 of the paper: are you considering by default the cost of moving
> data between platforms? Is the cost of moving data between platforms taken
> into account in the conversion operators, like the SqlToStreamOperator?  If
> so, Should I add a custom cost-function template in the “network” key of
> the wayang.postgres.sqltostream.load.output.template to take this data
> movement into account? Or the data transfer cost between platforms is
> considered in a different place and I should do it in a different way?
>
I am not 100% sure about this but
https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-commons/wayang-core/src/main/java/org/apache/wayang/core/optimizer/channels/DefaultChannelConversion.java#L181
could be a pointer.

Best,
Kaustubh




>
> Thanks & best regards,
> Josep
>
>
> From: Jorge Arnulfo Quiané Ruiz <qu...@gmail.com>
> Date: Friday, 26 May 2023 at 11:55
> To: dev@wayang.apache.org <de...@wayang.apache.org>
> Subject: [EXTERNAL] Re: Profiler and Cost Functions
> Hello Josep,
>
> Replying with a bit of delay because I have been travelling this week :)
>
> Regarding your second point, we basically have two ways of learning the
> cost parameters of the execution operators: by analysing execution logs
> (using the genetic optimizer) or by profiling individual operators. The
> package you refer to is for the latter (profiling individual execution
> operators). This was our original idea to get the cost parameters but we
> quickly found out that this was going to be very off from the real costs
> because most big data platforms exploit operator pipelining which makes it
> hard to profile individually. So, you cannot use the output of this
> individual profiler for the genetic algorithm.
>
> So, let us now discuss your first point which is regarding the Genetic
> Optimizer. So this was our solution to tackle the problem of the individual
> operator profiling approach. The genetic optimizer, instead, tries to get
> the operator costs by analysing execution logs. For this, it requires both
> a cost function template per execution operator (which should be specified
> in a json format:
> https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-platforms/wayang-spark/code/main/resources/wayang-spark-defaults.properties
> ) and wayang execution logs (i.e. running jobs via Wayang). The genetic
> optimizer will learn the coefficients (denoted by ? In the template
> function). To actually understand how it does so, our VLDBJ paper (also in
> Arxiv) gives a bit more details about and a pointer for the genetic
> optimization we use:
> https://arxiv.org/pdf/1805.03533.pdfSection  3.2 and Figure 4.
>
> Let us know if that helps.
>
> Best,
> Jorge
>
> > On 24 May 2023, at 11.12, Josep Sampe Domenech <
> Josep.Sampe.Domenech@ibm.com.INVALID> wrote:
> >
> > Hello dev,
> >
> >
> >
> > We recently started our exploration of the Wayang project and we would
> like to gain a deeper understanding of the profiler tool and its
> functionalities, specifically about the collection and use of metrics.
> >
> >
> >
> > To enhance our comprehension, we would appreciate your assistance in
> addressing the following queries:
> >
> >
> >
> >  1.  Could you please provide us with an explanation of how the
> GeneticOptimizerApp works? Specifically, we would like to understand which
> information from the executions.json file is taken into consideration when
> calculating the "?" parameters in the cost functions. Additionally, we are
> interested in learning more about the methodology employed to calculate the
> "?" values.
> >
> >
> >
> >  1.  We are also curious about the purpose of the profiler.spark
> package. What is the purpose of this package? Does it serve a specific
> objective?, and can the results obtained from this profiler.spark be
> utilized or integrated into the GeneticOptimizerApp?
> >
> >
> >
> >
> >
> > Thank you in advance for your time and attention. We look forward to
> your response.
> >
> >
> >
> > Best regards,
> >
> > Josep
> >
>
    

Re: Profiler and Cost Functions

Posted by Zoi Kaoudi <zk...@yahoo.gr.INVALID>.
 Hi Josep,
to answer your question about the data movement costs, you are right. Data movement is encoded with the conversion operators and thus their cost can be defined in the same way as all other operators.
Just a side note: To facilitate the cost model tuning for users, we have also another version of the optimizer which uses ML to predict the cost of execution plans. You can see details in our published paper of 2020: https://wayang.apache.org/assets/pdf/paper/icde20.pdfIt's not yet incorporated in Wayang but it is in our next steps.
Best
--
Zoi

    Στις Δευτέρα 5 Ιουνίου 2023 στις 07:00:51 π.μ. CEST, ο χρήστης Kaustubh Beedkar <kb...@gmail.com> έγραψε:  
 
 Dear Josep,

Let me try to answer these. Please see my response inline below


On Wed, May 31, 2023 at 4:37 PM Josep Sampe Domenech
<Jo...@ibm.com.invalid> wrote:

> Thanks Jorge, this helps a lot to clarify the points related to the
> Genetic Optimizer.
>
>
> I have a few additional questions on the subject matter:
>
>
>  1.  Regarding The platforms: Do you consider adding support for using
> multiple Spark or Postgres instances simultaneously? I noticed there is a
> branch on GitHub dedicated to this purpose, specifically implemented for
> Spark. I'm curious to know if this is just a proof of concept or if it's
> something you plan to incorporate in the future.
>
> In theory, Wayang can support multiple instances of the same platform.
However, this would require a unique identifier for each platform and
subsequent changes. This is very much in our scheme of things for the near
future.


>
>  1.  Regarding the operators: In the Postgres platform, I can see the
> Executor, Filter, Projection, and TableSource operators. Currently, when I
> read two tables from Postgres and perform a JOIN operation, it appears that
> the JOIN is executed locally within the Wayang environment using the Java
> streams platform, rather than running the JOIN operation directly within
> Postgres itself. Is it because the Join operator in Postgres has not been
> implemented yet? Or is it because, based on the cost functions, it is
> considered more cost-effective to execute the JOIN locally? Or am I missing
> something?
>
In this case, the join operator is not yet implemented. We are in the
process of supporting join pushdowns as a part of Wayang SQL API.

>
>
>
>  1.  Regarding the cost functions: To clarify some things related to the
> section 4 of the paper: are you considering by default the cost of moving
> data between platforms? Is the cost of moving data between platforms taken
> into account in the conversion operators, like the SqlToStreamOperator?  If
> so, Should I add a custom cost-function template in the “network” key of
> the wayang.postgres.sqltostream.load.output.template to take this data
> movement into account? Or the data transfer cost between platforms is
> considered in a different place and I should do it in a different way?
>
I am not 100% sure about this but
https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-commons/wayang-core/src/main/java/org/apache/wayang/core/optimizer/channels/DefaultChannelConversion.java#L181
could be a pointer.

Best,
Kaustubh




>
> Thanks & best regards,
> Josep
>
>
> From: Jorge Arnulfo Quiané Ruiz <qu...@gmail.com>
> Date: Friday, 26 May 2023 at 11:55
> To: dev@wayang.apache.org <de...@wayang.apache.org>
> Subject: [EXTERNAL] Re: Profiler and Cost Functions
> Hello Josep,
>
> Replying with a bit of delay because I have been travelling this week :)
>
> Regarding your second point, we basically have two ways of learning the
> cost parameters of the execution operators: by analysing execution logs
> (using the genetic optimizer) or by profiling individual operators. The
> package you refer to is for the latter (profiling individual execution
> operators). This was our original idea to get the cost parameters but we
> quickly found out that this was going to be very off from the real costs
> because most big data platforms exploit operator pipelining which makes it
> hard to profile individually. So, you cannot use the output of this
> individual profiler for the genetic algorithm.
>
> So, let us now discuss your first point which is regarding the Genetic
> Optimizer. So this was our solution to tackle the problem of the individual
> operator profiling approach. The genetic optimizer, instead, tries to get
> the operator costs by analysing execution logs. For this, it requires both
> a cost function template per execution operator (which should be specified
> in a json format:
> https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-platforms/wayang-spark/code/main/resources/wayang-spark-defaults.properties
> ) and wayang execution logs (i.e. running jobs via Wayang). The genetic
> optimizer will learn the coefficients (denoted by ? In the template
> function). To actually understand how it does so, our VLDBJ paper (also in
> Arxiv) gives a bit more details about and a pointer for the genetic
> optimization we use:
> https://arxiv.org/pdf/1805.03533.pdfSection  3.2 and Figure 4.
>
> Let us know if that helps.
>
> Best,
> Jorge
>
> > On 24 May 2023, at 11.12, Josep Sampe Domenech <
> Josep.Sampe.Domenech@ibm.com.INVALID> wrote:
> >
> > Hello dev,
> >
> >
> >
> > We recently started our exploration of the Wayang project and we would
> like to gain a deeper understanding of the profiler tool and its
> functionalities, specifically about the collection and use of metrics.
> >
> >
> >
> > To enhance our comprehension, we would appreciate your assistance in
> addressing the following queries:
> >
> >
> >
> >  1.  Could you please provide us with an explanation of how the
> GeneticOptimizerApp works? Specifically, we would like to understand which
> information from the executions.json file is taken into consideration when
> calculating the "?" parameters in the cost functions. Additionally, we are
> interested in learning more about the methodology employed to calculate the
> "?" values.
> >
> >
> >
> >  1.  We are also curious about the purpose of the profiler.spark
> package. What is the purpose of this package? Does it serve a specific
> objective?, and can the results obtained from this profiler.spark be
> utilized or integrated into the GeneticOptimizerApp?
> >
> >
> >
> >
> >
> > Thank you in advance for your time and attention. We look forward to
> your response.
> >
> >
> >
> > Best regards,
> >
> > Josep
> >
>