You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sachit Murarka <co...@gmail.com> on 2022/10/03 16:22:55 UTC

Reading too many files

Hello,

I am reading too many files in Spark 3.2(Parquet) . It is not giving any
error in the logs. But after spark.read.parquet , it is not able to proceed
further.
Can anyone please suggest if there is any property to improve the parallel
reads? I am reading more than 25000 files .

Kind Regards,
Sachit Murarka

Re: Reading too many files

Posted by Enrico Minack <in...@enrico.minack.dev>.

Hi,

Spark is fine with that many Parquet files in general:

# generate 100,000 small Parquet files
spark.range(0, 1000000, 1, 100000).write.parquet("too-many-files.parquet")

# read 100,000 Parquet files
val df = spark.read.parquet("too-many-files.parquet")
df.show()
df.count()

Reading the files takes a few seconds, so there is no problem with the 
number of files.

What exactly do you mean with "But after spark.read.parquet , it is not 
able to proceed further."?

Does that mean that executing the line
   val df = spark.read.parquet("too-many-files.parquet")
takes forever?

How long do individual tasks take? How many tasks are there for this line?
Where are the Parquet files stored? Where does the Spark job run?

Enrico



Am 03.10.22 um 18:22 schrieb Sachit Murarka:
> Hello,
>
> I am reading too many files in Spark 3.2(Parquet) . It is not giving 
> any error in the logs. But after spark.read.parquet , it is not able 
> to proceed further.
> Can anyone please suggest if there is any property to improve the 
> parallel reads? I am reading more than 25000 files .
>
> Kind Regards,
> Sachit Murarka

Re: Reading too many files

Posted by Henrik Pang <he...@simplemail.co.in>.

you may need a large cluster memory and fast disk IO.


Sachit Murarka wrote:
> Can anyone please suggest if there is any property to improve the 
> parallel reads? I am reading more than 25000 files .

-- 
Simple Mail
https://simplemail.co.in/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Reading too many files

Posted by Sid <fl...@gmail.com>.

Are you trying to run on cloud ?

On Mon, 3 Oct 2022, 21:55 Sachit Murarka, <co...@gmail.com> wrote:

> Hello,
>
> I am reading too many files in Spark 3.2(Parquet) . It is not giving any
> error in the logs. But after spark.read.parquet , it is not able to proceed
> further.
> Can anyone please suggest if there is any property to improve the parallel
> reads? I am reading more than 25000 files .
>
> Kind Regards,
> Sachit Murarka
>