You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sachit Murarka <co...@gmail.com> on 2022/10/03 16:22:55 UTC
Reading too many files
Hello,
I am reading too many files in Spark 3.2(Parquet) . It is not giving any
error in the logs. But after spark.read.parquet , it is not able to proceed
further.
Can anyone please suggest if there is any property to improve the parallel
reads? I am reading more than 25000 files .
Kind Regards,
Sachit Murarka
Re: Reading too many files
Posted by Enrico Minack <in...@enrico.minack.dev>.
Hi,
Spark is fine with that many Parquet files in general:
# generate 100,000 small Parquet files
spark.range(0, 1000000, 1, 100000).write.parquet("too-many-files.parquet")
# read 100,000 Parquet files
val df = spark.read.parquet("too-many-files.parquet")
df.show()
df.count()
Reading the files takes a few seconds, so there is no problem with the
number of files.
What exactly do you mean with "But after spark.read.parquet , it is not
able to proceed further."?
Does that mean that executing the line
val df = spark.read.parquet("too-many-files.parquet")
takes forever?
How long do individual tasks take? How many tasks are there for this line?
Where are the Parquet files stored? Where does the Spark job run?
Enrico
Am 03.10.22 um 18:22 schrieb Sachit Murarka:
> Hello,
>
> I am reading too many files in Spark 3.2(Parquet) . It is not giving
> any error in the logs. But after spark.read.parquet , it is not able
> to proceed further.
> Can anyone please suggest if there is any property to improve the
> parallel reads? I am reading more than 25000 files .
>
> Kind Regards,
> Sachit Murarka
Re: Reading too many files
Posted by Henrik Pang <he...@simplemail.co.in>.
you may need a large cluster memory and fast disk IO.
Sachit Murarka wrote:
> Can anyone please suggest if there is any property to improve the
> parallel reads? I am reading more than 25000 files .
--
Simple Mail
https://simplemail.co.in/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Reading too many files
Posted by Sid <fl...@gmail.com>.
Are you trying to run on cloud ?
On Mon, 3 Oct 2022, 21:55 Sachit Murarka, <co...@gmail.com> wrote:
> Hello,
>
> I am reading too many files in Spark 3.2(Parquet) . It is not giving any
> error in the logs. But after spark.read.parquet , it is not able to proceed
> further.
> Can anyone please suggest if there is any property to improve the parallel
> reads? I am reading more than 25000 files .
>
> Kind Regards,
> Sachit Murarka
>