You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joel D <ga...@gmail.com> on 2018/10/10 21:56:04 UTC

Process Million Binary Files

Hi,

I need to process millions of PDFs in hdfs using spark. First I’m trying
with some 40k files. I’m using binaryFiles api with which I’m facing couple
of issues:

1. It creates only 4 tasks and I can’t seem to increase the parallelism
there.
2. It took 2276 seconds and that means for millions of files it will take
ages to complete. I’m also expecting it to fail for million records with
some timeout or gc overhead exception.

Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache

Val fileContentRdd = files.map(file => myFunc(file)



Do you have any guidance on how I can process millions of files using
binaryFiles api?

How can I increase the number of tasks/parallelism during the creation of
files rdd?

Thanks

Re: Process Million Binary Files

Posted by Jörn Franke <jo...@gmail.com>.
I believe your use case can be better covered with an own data source reading PDF files.

 On Big Data platforms in general you have the issue that individual PDF files are very small and are a lot of them - this is not very efficient for those platforms. That could be also one source of your performance problems (not necessarily the parallelism). You would need to make 1 mio requests to the namenode (this could be also interpreted as a Denial-of-Service attack). Historically, Hadoop Archives were introduced to address this problem: https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html

You can try also to store them first in Hbase or in the future on Hadoop Ozone. That could make a higher parallelism possible „out of the box“. 

> Am 10.10.2018 um 23:56 schrieb Joel D <ga...@gmail.com>:
> 
> Hi,
> 
> I need to process millions of PDFs in hdfs using spark. First I’m trying with some 40k files. I’m using binaryFiles api with which I’m facing couple of issues:
> 
> 1. It creates only 4 tasks and I can’t seem to increase the parallelism there. 
> 2. It took 2276 seconds and that means for millions of files it will take ages to complete. I’m also expecting it to fail for million records with some timeout or gc overhead exception.
> 
> Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache
> 
> Val fileContentRdd = files.map(file => myFunc(file)
> 
> 
> 
> Do you have any guidance on how I can process millions of files using binaryFiles api?
> 
> How can I increase the number of tasks/parallelism during the creation of files rdd?
> 
> Thanks
> 

Re: Process Million Binary Files

Posted by Nicolas PARIS <ni...@riseup.net>.
Hi Joel

I built such pipeline to transform pdf-> text
https://github.com/EDS-APHP/SparkPdfExtractor
You can take a look

It transforms 20M pdfs in 2 hours on a 5 node spark cluster 

Le 2018-10-10 23:56, Joel D a écrit :
> Hi,
> 
> I need to process millions of PDFs in hdfs using spark. First I’m
> trying with some 40k files. I’m using binaryFiles api with which
> I’m facing couple of issues:
> 
> 1. It creates only 4 tasks and I can’t seem to increase the
> parallelism there. 
> 2. It took 2276 seconds and that means for millions of files it will
> take ages to complete. I’m also expecting it to fail for million
> records with some timeout or gc overhead exception.
> 
> Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache
> 
> Val fileContentRdd = files.map(file => myFunc(file)
> 
> Do you have any guidance on how I can process millions of files using
> binaryFiles api?
> 
> How can I increase the number of tasks/parallelism during the creation
> of files rdd?
> 
> Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org