You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@community.apache.org by "Bertty Contreras (Jira)" <ji...@apache.org> on 2022/04/19 10:13:00 UTC

[jira] [Updated] (COMDEV-475) Apache Wayang(Incubating): Efficiently Dealing with Iterative Jobs

     [ https://issues.apache.org/jira/browse/COMDEV-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bertty Contreras updated COMDEV-475:
------------------------------------
    Description: 
*Synopsis*

The current Apache Wayang (Incubating) uses a cost model to compute the right platforms and optimize the plans. The Apache Wayang(Incubating) Query Optimizer currently supports the loop operations used in the machine learning context. Nevertheless, the current loop optimization is primitive and does not support many aspects that are required to get the correct prediction, e.g., support changes in the cardinality of the data inside a loop, expansion of the loop, support for loop with condition dynamics.

 

*Benefits to Community*

The benefits for the community will be getting a better approach to deal with iterative jobs in Apache Wayang (Incubating). Particularly, the community will benefit of a more efficient approach to run their ML workloads over big data.

 

*Deliverables*

The delivery expected is to add a robust solution to optimize the loops inside of the query optimizer of Apache Wayang(Incubating).

 

The step expected are the following:
 * Understand the paper [1]
 * Get into the internal Apache Wayang(Incubating) cost model
 * Discuss and design the solution for the loop optimizations
 * Implement and integrate the loop optimization inside of the current Query Optimizer 

 

*Related Work*

[1] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])

 

*Biographical Information of possible Mentors*

 

Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing intensive processing data systems for several industries, such as banking systems. He was a research engineer at the Qatar Computing Research Institute, where he was responsible for developing the declarative query engine for Rheem and adding new underlying platforms to Rheem.

 

Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing applications that support Big Data processing, with experience implementing ETL processes over distributed systems to optimize inventories in supply chains. He was a research engineer at the Qatar Computing Research Institute, where he specialized in human interface interaction with big data analytics. During this time, he co-develop an ML-based cross-platform query optimizer.

 

Jorge Quiané is the head of the Big Data Systems research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD) and a Principal Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of the IAM group at the German Research Center for ArtificialIntelligence (DFKI). His current research is in the broad area of big data: mainly in federated data analytics, scalable data infrastructures, and distributed query processing. He has published numerous research papers on data management and novel system architectures. He has recently been honoured with the 2022 ACM SIGMOD Research Highlight Award and the Best Paper Award at ICDE 2021 for his work on “EfficientControl Flow in Dataflow Systems”. He holds five patents in core database areas and on machine learning. Earlier in his career, he was a Senior Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral Researcher at Saarland University. He obtained his PhD in computer science from INRIA (Nantes University).

 

*Name and Contact Information*

Name: Bertty Contreras-Rojas

email: bertty (at) apache.org

community: dev (at) wayang.apache.org

website: [https://wayang.apache.org|https://wayang.apache.org/]

  was:
*Synopsis*

The current Apache Wayang (Incubating) uses a cost model to compute the right platforms and optimize the plans. The Apache Wayang(Incubating) Query Optimizer currently supports the loop operations used in the machine learning context. Nevertheless, the current loop optimization is primitive and does not support many aspects that are required to get the correct prediction, e.g., support changes in the cardinality of the data inside a loop, expansion of the loop, support for loop with condition dynamics.

 

*Benefits to Community*

The benefits for the community will be getting a better approach to deal with iterative jobs in Apache Wayang (Incubating). Particularly, the community will benefit of a more efficient approach to run their ML workloads over big data.

 

*Deliverables*

The delivery expected is to add a robust solution to optimize the loops inside of the query optimizer of Apache Wayang(Incubating).

 

The step expected are the following:
 * Understand the paper [1]
 * Get into the internal Apache Wayang(Incubating) cost model
 * Discuss and design the solution for the loop optimizations
 * Implement and integrate the loop optimization inside of the current Query Optimizer 

 

*Related Work*

[1] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])

 

*Biographical Information of possible Mentors*

 

Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing intensive processing data systems for several industries, such as banking systems. He was a research engineer at the Qatar Computing Research Institute, where he was responsible for developing the declarative query engine for Rheem and adding new underlying platforms to Rheem.

 

Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing applications that support Big Data processing, with experience implementing ETL processes over distributed systems to optimize inventories in supply chains. He was a research engineer at the Qatar Computing Research Institute, where he specialized in human interface interaction with big data analytics. During this time, he co-develop an ML-based cross-platform query optimizer.

 

Jorge Quiané is the head of the Big Data Systems research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD) and a Principal Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of the IAM group at the German Research Center for ArtificialIntelligence (DFKI). His current research is in the broad area of big data: mainly in federated data analytics, scalable data infrastructures, and distributed query processing. He has published numerous research papers on data management and novel system architectures. He has recently been honoured with the 2022 ACM SIGMOD Research Highlight Award and the Best Paper Award at ICDE 2021 for his work on “EfficientControl Flow in Dataflow Systems”. He holds five patents in core database areas and on machine learning. Earlier in his career, he was a Senior Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral Researcher at Saarland University. He obtained his PhD in computer science from INRIA (Nantes University).

 

 


> Apache Wayang(Incubating): Efficiently Dealing with Iterative Jobs
> ------------------------------------------------------------------
>
>                 Key: COMDEV-475
>                 URL: https://issues.apache.org/jira/browse/COMDEV-475
>             Project: Community Development
>          Issue Type: New Feature
>          Components: GSoC/Mentoring ideas
>            Reporter: Bertty Contreras
>            Priority: Critical
>              Labels: gsoc, gsoc2022, machine_learning
>   Original Estimate: 350h
>  Remaining Estimate: 350h
>
> *Synopsis*
> The current Apache Wayang (Incubating) uses a cost model to compute the right platforms and optimize the plans. The Apache Wayang(Incubating) Query Optimizer currently supports the loop operations used in the machine learning context. Nevertheless, the current loop optimization is primitive and does not support many aspects that are required to get the correct prediction, e.g., support changes in the cardinality of the data inside a loop, expansion of the loop, support for loop with condition dynamics.
>  
> *Benefits to Community*
> The benefits for the community will be getting a better approach to deal with iterative jobs in Apache Wayang (Incubating). Particularly, the community will benefit of a more efficient approach to run their ML workloads over big data.
>  
> *Deliverables*
> The delivery expected is to add a robust solution to optimize the loops inside of the query optimizer of Apache Wayang(Incubating).
>  
> The step expected are the following:
>  * Understand the paper [1]
>  * Get into the internal Apache Wayang(Incubating) cost model
>  * Discuss and design the solution for the loop optimizations
>  * Implement and integrate the loop optimization inside of the current Query Optimizer 
>  
> *Related Work*
> [1] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])
>  
> *Biographical Information of possible Mentors*
>  
> Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing intensive processing data systems for several industries, such as banking systems. He was a research engineer at the Qatar Computing Research Institute, where he was responsible for developing the declarative query engine for Rheem and adding new underlying platforms to Rheem.
>  
> Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing applications that support Big Data processing, with experience implementing ETL processes over distributed systems to optimize inventories in supply chains. He was a research engineer at the Qatar Computing Research Institute, where he specialized in human interface interaction with big data analytics. During this time, he co-develop an ML-based cross-platform query optimizer.
>  
> Jorge Quiané is the head of the Big Data Systems research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD) and a Principal Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of the IAM group at the German Research Center for ArtificialIntelligence (DFKI). His current research is in the broad area of big data: mainly in federated data analytics, scalable data infrastructures, and distributed query processing. He has published numerous research papers on data management and novel system architectures. He has recently been honoured with the 2022 ACM SIGMOD Research Highlight Award and the Best Paper Award at ICDE 2021 for his work on “EfficientControl Flow in Dataflow Systems”. He holds five patents in core database areas and on machine learning. Earlier in his career, he was a Senior Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral Researcher at Saarland University. He obtained his PhD in computer science from INRIA (Nantes University).
>  
> *Name and Contact Information*
> Name: Bertty Contreras-Rojas
> email: bertty (at) apache.org
> community: dev (at) wayang.apache.org
> website: [https://wayang.apache.org|https://wayang.apache.org/]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@community.apache.org
For additional commands, e-mail: dev-help@community.apache.org