You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sun Rui <su...@163.com> on 2016/07/01 02:16:21 UTC

Re: One map per folder in spark or Hadoop

Say you have got all of your folder paths into a val folders: Seq[String]

val add = sc.parallelize(folders, folders.size).mapPartitions { iter =>
  val folder = iter.next
  val status: Int = <call your executable with the folder path string>
  Seq(status).toIterator
}

> On Jun 30, 2016, at 16:42, Balachandar R.A. <ba...@gmail.com> wrote:
> 
> Hello,
> 
> I have some 100 folders. Each folder contains 5 files. I have an executable that process one folder. The executable is a black box and hence it cannot be modified.I would like to process 100 folders in parallel using Apache spark so that I should be able to span a map task per folder. Can anyone give me an idea? I have came across similar questions but with Hadoop and answer was to use combineFileInputFormat and pathFilter. However, as I said, I want to use Apache spark. Any idea?
> 
> Regards 
> Bala
>

Re: One map per folder in spark or Hadoop

Posted by Deepak Sharma <de...@gmail.com>.

You have to distribute the files in some distributed file system like hdfs.
Or else copy the files to all executors local file system and make sure to
mention the file scheme in the URI explicitly.

Thanks
Deepak

On Thu, Jul 7, 2016 at 7:13 PM, Balachandar R.A. <ba...@gmail.com>
wrote:

> Hi
>
> Thanks for the code snippet. If the executable inside the map process
> needs to access directories and files present in the local file system. Is
> it possible? I know they are running in slave node in a temporary working
> directory and i can think about distributed cache. But still would like to
> know if the map process can access local file system
>
> Regards
> Bala
> On 01-Jul-2016 7:46 am, "Sun Rui" <su...@163.com> wrote:
>
>> Say you have got all of your folder paths into a val folders: Seq[String]
>>
>> val add = sc.parallelize(folders, folders.size).mapPartitions { iter =>
>>   val folder = iter.next
>>   val status: Int = <call your executable with the folder path string>
>>   Seq(status).toIterator
>> }
>>
>> On Jun 30, 2016, at 16:42, Balachandar R.A. <ba...@gmail.com>
>> wrote:
>>
>> Hello,
>>
>> I have some 100 folders. Each folder contains 5 files. I have an
>> executable that process one folder. The executable is a black box and hence
>> it cannot be modified.I would like to process 100 folders in parallel using
>> Apache spark so that I should be able to span a map task per folder. Can
>> anyone give me an idea? I have came across similar questions but with
>> Hadoop and answer was to use combineFileInputFormat and pathFilter.
>> However, as I said, I want to use Apache spark. Any idea?
>>
>> Regards
>> Bala
>>
>>
>>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: One map per folder in spark or Hadoop

Posted by "Balachandar R.A." <ba...@gmail.com>.

Hi

Thanks for the code snippet. If the executable inside the map process needs
to access directories and files present in the local file system. Is it
possible? I know they are running in slave node in a temporary working
directory and i can think about distributed cache. But still would like to
know if the map process can access local file system

Regards
Bala
On 01-Jul-2016 7:46 am, "Sun Rui" <su...@163.com> wrote:

> Say you have got all of your folder paths into a val folders: Seq[String]
>
> val add = sc.parallelize(folders, folders.size).mapPartitions { iter =>
>   val folder = iter.next
>   val status: Int = <call your executable with the folder path string>
>   Seq(status).toIterator
> }
>
> On Jun 30, 2016, at 16:42, Balachandar R.A. <ba...@gmail.com>
> wrote:
>
> Hello,
>
> I have some 100 folders. Each folder contains 5 files. I have an
> executable that process one folder. The executable is a black box and hence
> it cannot be modified.I would like to process 100 folders in parallel using
> Apache spark so that I should be able to span a map task per folder. Can
> anyone give me an idea? I have came across similar questions but with
> Hadoop and answer was to use combineFileInputFormat and pathFilter.
> However, as I said, I want to use Apache spark. Any idea?
>
> Regards
> Bala
>
>
>

Re: One map per folder in spark or Hadoop

Posted by "Balachandar R.A." <ba...@gmail.com>.

Thank you very much.  I will try this code and update you

Regards
Bala
On 01-Jul-2016 7:46 am, "Sun Rui" <su...@163.com> wrote:

> Say you have got all of your folder paths into a val folders: Seq[String]
>
> val add = sc.parallelize(folders, folders.size).mapPartitions { iter =>
>   val folder = iter.next
>   val status: Int = <call your executable with the folder path string>
>   Seq(status).toIterator
> }
>
> On Jun 30, 2016, at 16:42, Balachandar R.A. <ba...@gmail.com>
> wrote:
>
> Hello,
>
> I have some 100 folders. Each folder contains 5 files. I have an
> executable that process one folder. The executable is a black box and hence
> it cannot be modified.I would like to process 100 folders in parallel using
> Apache spark so that I should be able to span a map task per folder. Can
> anyone give me an idea? I have came across similar questions but with
> Hadoop and answer was to use combineFileInputFormat and pathFilter.
> However, as I said, I want to use Apache spark. Any idea?
>
> Regards
> Bala
>
>
>