You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Peter Wolf <op...@gmail.com> on 2015/07/29 03:20:38 UTC

Spark and Speech Recognition

Hello, I am writing a Spark application to use speech recognition to
transcribe a very large number of recordings.

I need some help configuring Spark.

My app is basically a transformation with no side effects: recording URL
--> transcript.  The input is a huge file with one URL per line, and the
output is a huge file of transcripts.

The speech recognizer is written in Java (Sphinx4), so it can be packaged
as a JAR.

The recognizer is very processor intensive, so you can't run too many on
one machine-- perhaps one recognizer per core.  The recognizer is also
big-- maybe 1 GB.  But, most of the recognizer is a immutable acoustic and
language models that can be shared with other instances of the recognizer.

So I want to run about one recognizer per core of each machine in my
cluster.  I want all recognizer on one machine to run within the same JVM
and share the same models.

How does one configure Spark for this sort of application?  How does one
control how Spark deploys the stages of the process.  Can someone point me
to an appropriate doc or keywords I should Google.

Thanks
Peter

Re: Spark and Speech Recognition

Posted by Peter Wolf <op...@gmail.com>.
Oh...  That was embarrassingly easy!

Thank you that was exactly the understanding of partitions that I needed.

P

On Thu, Jul 30, 2015 at 6:35 AM, Simon Elliston Ball <
simon@simonellistonball.com> wrote:

> You might also want to consider broadcasting the models to ensure you get
> one instance shared across cores in each machine, otherwise the model will
> be serialised to each task and you'll get a copy per executor (roughly core
> in this instance)
>
> Simon
>
> Sent from my iPhone
>
> On 30 Jul 2015, at 10:14, Akhil Das <ak...@sigmoidanalytics.com> wrote:
>
> Like this?
>
> val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls
> => speachRecognizer(urls))
>
> Let 24 be the total number of cores that you have on all the workers.
>
> Thanks
> Best Regards
>
> On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf <op...@gmail.com> wrote:
>
>> Hello, I am writing a Spark application to use speech recognition to
>> transcribe a very large number of recordings.
>>
>> I need some help configuring Spark.
>>
>> My app is basically a transformation with no side effects: recording URL
>> --> transcript.  The input is a huge file with one URL per line, and the
>> output is a huge file of transcripts.
>>
>> The speech recognizer is written in Java (Sphinx4), so it can be packaged
>> as a JAR.
>>
>> The recognizer is very processor intensive, so you can't run too many on
>> one machine-- perhaps one recognizer per core.  The recognizer is also
>> big-- maybe 1 GB.  But, most of the recognizer is a immutable acoustic and
>> language models that can be shared with other instances of the recognizer.
>>
>> So I want to run about one recognizer per core of each machine in my
>> cluster.  I want all recognizer on one machine to run within the same JVM
>> and share the same models.
>>
>> How does one configure Spark for this sort of application?  How does one
>> control how Spark deploys the stages of the process.  Can someone point me
>> to an appropriate doc or keywords I should Google.
>>
>> Thanks
>> Peter
>>
>
>

Re: Spark and Speech Recognition

Posted by Simon Elliston Ball <si...@simonellistonball.com>.
You might also want to consider broadcasting the models to ensure you get one instance shared across cores in each machine, otherwise the model will be serialised to each task and you'll get a copy per executor (roughly core in this instance)

Simon 

Sent from my iPhone

> On 30 Jul 2015, at 10:14, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> 
> Like this?
> 
> val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls => speachRecognizer(urls))
> 
> Let 24 be the total number of cores that you have on all the workers.
> 
> Thanks
> Best Regards
> 
>> On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf <op...@gmail.com> wrote:
>> Hello, I am writing a Spark application to use speech recognition to transcribe a very large number of recordings.
>> 
>> I need some help configuring Spark.
>> 
>> My app is basically a transformation with no side effects: recording URL --> transcript.  The input is a huge file with one URL per line, and the output is a huge file of transcripts.  
>> 
>> The speech recognizer is written in Java (Sphinx4), so it can be packaged as a JAR.
>> 
>> The recognizer is very processor intensive, so you can't run too many on one machine-- perhaps one recognizer per core.  The recognizer is also big-- maybe 1 GB.  But, most of the recognizer is a immutable acoustic and language models that can be shared with other instances of the recognizer.
>> 
>> So I want to run about one recognizer per core of each machine in my cluster.  I want all recognizer on one machine to run within the same JVM and share the same models.
>> 
>> How does one configure Spark for this sort of application?  How does one control how Spark deploys the stages of the process.  Can someone point me to an appropriate doc or keywords I should Google.
>> 
>> Thanks
>> Peter 
> 

Re: Spark and Speech Recognition

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Like this?

val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls =>
speachRecognizer(urls))

Let 24 be the total number of cores that you have on all the workers.

Thanks
Best Regards

On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf <op...@gmail.com> wrote:

> Hello, I am writing a Spark application to use speech recognition to
> transcribe a very large number of recordings.
>
> I need some help configuring Spark.
>
> My app is basically a transformation with no side effects: recording URL
> --> transcript.  The input is a huge file with one URL per line, and the
> output is a huge file of transcripts.
>
> The speech recognizer is written in Java (Sphinx4), so it can be packaged
> as a JAR.
>
> The recognizer is very processor intensive, so you can't run too many on
> one machine-- perhaps one recognizer per core.  The recognizer is also
> big-- maybe 1 GB.  But, most of the recognizer is a immutable acoustic and
> language models that can be shared with other instances of the recognizer.
>
> So I want to run about one recognizer per core of each machine in my
> cluster.  I want all recognizer on one machine to run within the same JVM
> and share the same models.
>
> How does one configure Spark for this sort of application?  How does one
> control how Spark deploys the stages of the process.  Can someone point me
> to an appropriate doc or keywords I should Google.
>
> Thanks
> Peter
>