You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Jihoon Son (JIRA)" <ji...@apache.org> on 2014/12/30 10:07:13 UTC
[jira] [Comment Edited] (TAJO-982) Improve Fetcher to get multiple shuffle outputs through a request

    [ https://issues.apache.org/jira/browse/TAJO-982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260913#comment-14260913 ] 

Jihoon Son edited comment on TAJO-982 at 12/30/14 9:07 AM:
-----------------------------------------------------------

Hi guys, I have two ideas for this issue.
* When writing intermediate data for shuffle, we can merge small files into larger ones. I think that this is not feasible because it requires that the task assignment should be considered when merging files, thereby causing static task assignment.
* As described in this issue, we can improve fetchers to get multiple files via a request. This approach subsequently introduces another issue related to the transmission protocol. I'm also considering two approaches as follows:
** Using HTTP as in the current implementation, but improves the Fetchers and PullServers to handle an HTTP request for multiple files. For example, a Fetcher can request a virtual HTTP address that indicates multiple files. A PullServer who receives that request can extract real file names from the virtual address, and then dynamically merge those files into a single file and send it.
** Using an alternative transmission protocol that natively supports the transmission of multiple files via a request. 

I think the last one is the best approach, but I don't still have much background for that.
What do you think of these approaches?


was (Author: jihoonson):
Hi guys, I have two ideas for this issue.
* When writing intermediate data for shuffle, we can merge small files into larger ones. I think that this is not feasible because it requires that the task assignment should be considered when merging files, thereby causing static task assignment.
* As described in this issue, we can improve fetchers to get multiple files via a request. This approach subsequently introduces another issue related to the transmission protocol. I'm also considering two approaches as follows:
** Using HTTP as in the current implementation, but improves the Fetchers and PullServers to handle an HTTP request for multiple files. For example, a Fetcher can request a virtual HTTP address that indicates multiple files. A PullServer who receives that request can extract real file names from the virtual address, and then dynamically merge those files into one file and send it.
** Using an alternative transmission protocol that natively supports the transmission of multiple files via a request. 

I think the last one is the best approach, but I don't still have much background for that.
What do you think of these approaches?

> Improve Fetcher to get multiple shuffle outputs through a request
> -----------------------------------------------------------------
>
>                 Key: TAJO-982
>                 URL: https://issues.apache.org/jira/browse/TAJO-982
>             Project: Tajo
>          Issue Type: Improvement
>          Components: data shuffle
>            Reporter: Hyunsik Choi
>            Assignee: Jihoon Son
>             Fix For: 0.10
>
>
> Currently, Fetcher only can request at most a fetch for one shuffle output at a time. The implementation can cause performance degradation even though intermediate data is actually small.
> For example, If an input data set of the first stage is big and the intermediate data is very small, QueryMaster will choose a few of nodes for next execution block. Since each node keeps limited threads for fetch, it will take a long time for the nodes for next stage to fetch all intermediate.
> If Fetcher can get multiple shuffle outputs through a request, it would solve the slowness which occurs in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)