You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ayush Saxena (Jira)" <ji...@apache.org> on 2021/03/27 08:49:00 UTC
[jira] [Commented] (HADOOP-17558) DistCp: Reduce memory usage using a fixed size ThreadPoolExecutor

    [ https://issues.apache.org/jira/browse/HADOOP-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309902#comment-17309902 ] 

Ayush Saxena commented on HADOOP-17558:
---------------------------------------

Attached a draft patch, just to highlight the direction being chased, Just for basic code level idea.(Nothing like I plan to remove the present code, break compatibility or do such raw stuff)

The key points are:
 *  Have a fixed Queue size, unlike the present one, which keeps on expanding, the producer is multi-threaded and consumer is single in trunk, so having a fixed queue size, prevents the queue getting over-sized.
 * Using CallersRunPolicy for Auto throttling
 * The threads not only just list but also consumes the files, if there is a directory we store it for the response, if it is a file, we process it and get rid of the burden of its FileStatus. Unlike the present Producer-Consumer, now consumption of files is atleast multi-threaded.
 * Using ListStatusIterator instead of ListStatus, Since we are consuming also, this should also help reduce the memory pressure.

TODO's:
 * Since we are adding futures as part of processing a future, I need to find a good and clean way to know when everything is done, (As of now, did some dirty stuff to see how it goes, Iterator isn't thread safe, something like {{waitForTPEIdle}} and {{checkFutures}} in my present patch)
 * Good check on synchronisation and locks, Since we are consuming the files and writing to the sequence file in parallel. May be having a sequence file per thread may be an ALT and we merge all of them in the end and get rid of this synchronisation problem? Not sure, need to think..
 * Test if It works in the real world (UTs do Pass), See how much performance gain(My -useIterator test on S3 atleast completes faster as compared to my previous useiterator mode), Main stuff, Test how much it is helping to save memory. (Nothing done as of now, All in theory as of now)

Will try sort out these things, and come up with a some more updates as I progress further.

cc. [~rajesh.balamohan]/ [~stevel@apache.org]/ [~weichiu]

> DistCp: Reduce memory usage using a fixed size ThreadPoolExecutor
> -----------------------------------------------------------------
>
>                 Key: HADOOP-17558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17558
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Ayush Saxena
>            Priority: Major
>         Attachments: HADOOP-17558-DRAFT-01.patch
>
>
> For S3 and other object stores, where listing is slow, use a fixed size TPE for building listing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org