You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kapil Garg <ka...@flipkart.com.INVALID> on 2021/05/05 14:45:05 UTC

How to read multiple HDFS directories

Hi,
I am facing issues while reading multiple HDFS directories. Please read the
problem statement and current approach below

*Problem Statement*
There are N HDFS directories each having K files. We want to read data from
all directories such that when we read data from directory D, we map all
the data and augment it with additional information specific to that
directory.

*Current Approach*
In current approach, we are iterating over the directories, reading it in
RDD, mapping the RDD and the putting the RDD into a list.
After all N directories have been read, we have a list of N RDDs
We call spark Union on the list to merge them together.

This approach is causing data skewness because there is 1 directory of size
12 GBs whereas other RDDs are less than 1 GB. So when the large RDD's turn
comes, spark submits its task on available executors causing the RDD to
present on few executors instead of spreading on all.

Is there a way to avoid this data skewness ? I couldn't find any RDD API,
spark config which could enforce the data reading tasks evenly on all
executors.

--
Regards
Kapil Garg

*-----------------------------------------------------------------------------------------*

*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
strictly prohibited.*****

****

*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
arising.*****

****

*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.*

_-----------------------------------------------------------------------------------------_