You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shashidhar Rao <ra...@gmail.com> on 2015/07/14 07:18:06 UTC

Research ideas using spark

Hi,

I am doing my PHD thesis on large scale machine learning e.g  Online
learning, batch and mini batch learning.

Could somebody help me with ideas especially in the context of Spark and to
the above learning methods.

Some ideas like improvement to existing algorithms, implementing new
features especially the above learning methods and algorithms that have not
been implemented etc.

If somebody could help me with some ideas it would really accelerate my
work.

Plus few ideas on research papers regarding Spark or Mahout.

Thanks in advance.

Regards

Re: Research ideas using spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Try to repartition it to a higher number (at least 3-4 times the total # of
cpu cores). What operation are you doing? It may happen that if you are
doing a join/groupBy sort of operation that task which is taking time is
having all the values, in that case you need to use a Partitioner which
will evenly distribute the keys across machines to speed up things.

Thanks
Best Regards

On Tue, Jul 14, 2015 at 11:12 AM, shahid ashraf <sh...@trialx.com> wrote:

> hi
>
> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
> partitions i get is 9. I am running a spark application , it gets stuck on
> one of tasks, looking at the UI it seems application is not using all nodes
> to do calculations. attached is the screen shot of tasks, it seems tasks
> are put on each node more then once. looking at tasks 8 tasks get completed
> under 7-8 minutes and one task takes around 30 minutes so causing the delay
> in results.
>
>
> On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>> Hi,
>>
>> I am doing my PHD thesis on large scale machine learning e.g  Online
>> learning, batch and mini batch learning.
>>
>> Could somebody help me with ideas especially in the context of Spark and
>> to the above learning methods.
>>
>> Some ideas like improvement to existing algorithms, implementing new
>> features especially the above learning methods and algorithms that have not
>> been implemented etc.
>>
>> If somebody could help me with some ideas it would really accelerate my
>> work.
>>
>> Plus few ideas on research papers regarding Spark or Mahout.
>>
>> Thanks in advance.
>>
>> Regards
>>
>
>
>
> --
> with Regards
> Shahid Ashraf
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

Re: Research ideas using spark

Posted by Ravindra <ra...@gmail.com>.

Look at this :
http://www.forbes.com/sites/lisabrownlee/2015/07/10/the-11-trillion-internet-of-things-big-data-and-pattern-of-life-pol-analytics/

On Wed, Jul 15, 2015 at 10:19 PM shahid ashraf <sh...@trialx.com> wrote:

> Sorry Guys!
>
> I mistakenly added my question to this thread( Research ideas using
> spark). Moreover people can ask any question , this spark user group is for
> that.
>
> Cheers!
> 😊
>
> On Wed, Jul 15, 2015 at 9:43 PM, Robin East <ro...@xense.co.uk>
> wrote:
>
>> Well said Will. I would add that you might want to investigate GraphChi
>> which claims to be able to run a number of large-scale graph processing
>> tasks on a workstation much quicker than a very large Hadoop cluster. It
>> would be interesting to know how widely applicable the approach GraphChi
>> takes and what implications it has for parallel/distributed computing
>> approaches. A rich seam to mine indeed.
>>
>> Robin
>>
>> On 15 Jul 2015, at 14:48, William Temperley <wi...@gmail.com>
>> wrote:
>>
>> There seems to be a bit of confusion here - the OP (doing the PhD) had
>> the thread hijacked by someone with a similar name asking a mundane
>> question.
>>
>> It would be a shame to send someone away so rudely, who may do valuable
>> work on Spark.
>>
>> Sashidar (not Sashid!) I'm personally interested in running graph
>> algorithms for image segmentation using MLib and Spark.  I've got many
>> questions though - like is it even going to give me a speed-up?  (
>> http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
>>
>> It's not obvious to me which classes of graph algorithms can be
>> implemented correctly and efficiently in a highly parallel manner.  There's
>> tons of work to be done here, I'm sure. Also, look at parallel geospatial
>> algorithms - there's a lot of work being done on this.
>>
>> Best, Will
>>
>>
>>
>> On 15 July 2015 at 09:01, Vineel Yalamarthy <vi...@gmail.com>
>> wrote:
>>
>>> Hi Daniel
>>>
>>> Well said
>>>
>>> Regards
>>> Vineel
>>>
>>> On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos <
>>> daniel.darabos@lynxanalytics.com> wrote:
>>>
>>>> Hi Shahid,
>>>> To be honest I think this question is better suited for Stack Overflow
>>>> than for a PhD thesis.
>>>>
>>>> On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf <sh...@trialx.com>
>>>> wrote:
>>>>
>>>>> hi
>>>>>
>>>>> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
>>>>> partitions i get is 9. I am running a spark application , it gets stuck on
>>>>> one of tasks, looking at the UI it seems application is not using all nodes
>>>>> to do calculations. attached is the screen shot of tasks, it seems tasks
>>>>> are put on each node more then once. looking at tasks 8 tasks get completed
>>>>> under 7-8 minutes and one task takes around 30 minutes so causing the delay
>>>>> in results.
>>>>>
>>>>>
>>>>> On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am doing my PHD thesis on large scale machine learning e.g  Online
>>>>>> learning, batch and mini batch learning.
>>>>>>
>>>>>> Could somebody help me with ideas especially in the context of Spark
>>>>>> and to the above learning methods.
>>>>>>
>>>>>> Some ideas like improvement to existing algorithms, implementing new
>>>>>> features especially the above learning methods and algorithms that have not
>>>>>> been implemented etc.
>>>>>>
>>>>>> If somebody could help me with some ideas it would really accelerate
>>>>>> my work.
>>>>>>
>>>>>> Plus few ideas on research papers regarding Spark or Mahout.
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> with Regards
>>>>> Shahid Ashraf
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>
>>>>
>>
>>
>
>
> --
> with Regards
> Shahid Ashraf
>

Re: Research ideas using spark

Posted by shahid ashraf <sh...@trialx.com>.

Sorry Guys!

I mistakenly added my question to this thread( Research ideas using spark).
Moreover people can ask any question , this spark user group is for that.

Cheers!
😊

On Wed, Jul 15, 2015 at 9:43 PM, Robin East <ro...@xense.co.uk> wrote:

> Well said Will. I would add that you might want to investigate GraphChi
> which claims to be able to run a number of large-scale graph processing
> tasks on a workstation much quicker than a very large Hadoop cluster. It
> would be interesting to know how widely applicable the approach GraphChi
> takes and what implications it has for parallel/distributed computing
> approaches. A rich seam to mine indeed.
>
> Robin
>
> On 15 Jul 2015, at 14:48, William Temperley <wi...@gmail.com>
> wrote:
>
> There seems to be a bit of confusion here - the OP (doing the PhD) had the
> thread hijacked by someone with a similar name asking a mundane question.
>
> It would be a shame to send someone away so rudely, who may do valuable
> work on Spark.
>
> Sashidar (not Sashid!) I'm personally interested in running graph
> algorithms for image segmentation using MLib and Spark.  I've got many
> questions though - like is it even going to give me a speed-up?  (
> http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
>
> It's not obvious to me which classes of graph algorithms can be
> implemented correctly and efficiently in a highly parallel manner.  There's
> tons of work to be done here, I'm sure. Also, look at parallel geospatial
> algorithms - there's a lot of work being done on this.
>
> Best, Will
>
>
>
> On 15 July 2015 at 09:01, Vineel Yalamarthy <vi...@gmail.com>
> wrote:
>
>> Hi Daniel
>>
>> Well said
>>
>> Regards
>> Vineel
>>
>> On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos <
>> daniel.darabos@lynxanalytics.com> wrote:
>>
>>> Hi Shahid,
>>> To be honest I think this question is better suited for Stack Overflow
>>> than for a PhD thesis.
>>>
>>> On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf <sh...@trialx.com>
>>> wrote:
>>>
>>>> hi
>>>>
>>>> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
>>>> partitions i get is 9. I am running a spark application , it gets stuck on
>>>> one of tasks, looking at the UI it seems application is not using all nodes
>>>> to do calculations. attached is the screen shot of tasks, it seems tasks
>>>> are put on each node more then once. looking at tasks 8 tasks get completed
>>>> under 7-8 minutes and one task takes around 30 minutes so causing the delay
>>>> in results.
>>>>
>>>>
>>>> On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am doing my PHD thesis on large scale machine learning e.g  Online
>>>>> learning, batch and mini batch learning.
>>>>>
>>>>> Could somebody help me with ideas especially in the context of Spark
>>>>> and to the above learning methods.
>>>>>
>>>>> Some ideas like improvement to existing algorithms, implementing new
>>>>> features especially the above learning methods and algorithms that have not
>>>>> been implemented etc.
>>>>>
>>>>> If somebody could help me with some ideas it would really accelerate
>>>>> my work.
>>>>>
>>>>> Plus few ideas on research papers regarding Spark or Mahout.
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Regards
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> with Regards
>>>> Shahid Ashraf
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>
>>>
>
>


-- 
with Regards
Shahid Ashraf

Re: Research ideas using spark

Posted by Robin East <ro...@xense.co.uk>.

Well said Will. I would add that you might want to investigate GraphChi which claims to be able to run a number of large-scale graph processing tasks on a workstation much quicker than a very large Hadoop cluster. It would be interesting to know how widely applicable the approach GraphChi takes and what implications it has for parallel/distributed computing approaches. A rich seam to mine indeed.

Robin
> On 15 Jul 2015, at 14:48, William Temperley <wi...@gmail.com> wrote:
> 
> There seems to be a bit of confusion here - the OP (doing the PhD) had the thread hijacked by someone with a similar name asking a mundane question.
> 
> It would be a shame to send someone away so rudely, who may do valuable work on Spark.
> 
> Sashidar (not Sashid!) I'm personally interested in running graph algorithms for image segmentation using MLib and Spark.  I've got many questions though - like is it even going to give me a speed-up?  (http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html <http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html>)
> 
> It's not obvious to me which classes of graph algorithms can be implemented correctly and efficiently in a highly parallel manner.  There's tons of work to be done here, I'm sure. Also, look at parallel geospatial algorithms - there's a lot of work being done on this.
> 
> Best, Will
> 
> 
> 
> On 15 July 2015 at 09:01, Vineel Yalamarthy <vineelyalamarthy@gmail.com <ma...@gmail.com>> wrote:
> Hi Daniel
> 
> Well said
> 
> Regards 
> Vineel
> 
> 
> On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos <daniel.darabos@lynxanalytics.com <ma...@lynxanalytics.com>> wrote:
> Hi Shahid,
> To be honest I think this question is better suited for Stack Overflow than for a PhD thesis.
> 
> On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf <shahid@trialx.com <ma...@trialx.com>> wrote:
> hi 
> 
> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. 
> 
> 
> On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <raoshashidhar123@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> I am doing my PHD thesis on large scale machine learning e.g  Online learning, batch and mini batch learning.
> 
> Could somebody help me with ideas especially in the context of Spark and to the above learning methods. 
> 
> Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc.
> 
> If somebody could help me with some ideas it would really accelerate my work.
> 
> Plus few ideas on research papers regarding Spark or Mahout.
> 
> Thanks in advance.
> 
> Regards 
> 
> 
> 
> -- 
> with Regards
> Shahid Ashraf
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Research ideas using spark

Posted by William Temperley <wi...@gmail.com>.

There seems to be a bit of confusion here - the OP (doing the PhD) had the
thread hijacked by someone with a similar name asking a mundane question.

It would be a shame to send someone away so rudely, who may do valuable
work on Spark.

Sashidar (not Sashid!) I'm personally interested in running graph
algorithms for image segmentation using MLib and Spark.  I've got many
questions though - like is it even going to give me a speed-up?  (
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

It's not obvious to me which classes of graph algorithms can be implemented
correctly and efficiently in a highly parallel manner.  There's tons of
work to be done here, I'm sure. Also, look at parallel geospatial
algorithms - there's a lot of work being done on this.

Best, Will

On 15 July 2015 at 09:01, Vineel Yalamarthy <vi...@gmail.com>
wrote:

> Hi Daniel
>
> Well said
>
> Regards
> Vineel
>
> On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos <
> daniel.darabos@lynxanalytics.com> wrote:
>
>> Hi Shahid,
>> To be honest I think this question is better suited for Stack Overflow
>> than for a PhD thesis.
>>
>> On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf <sh...@trialx.com> wrote:
>>
>>> hi
>>>
>>> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
>>> partitions i get is 9. I am running a spark application , it gets stuck on
>>> one of tasks, looking at the UI it seems application is not using all nodes
>>> to do calculations. attached is the screen shot of tasks, it seems tasks
>>> are put on each node more then once. looking at tasks 8 tasks get completed
>>> under 7-8 minutes and one task takes around 30 minutes so causing the delay
>>> in results.
>>>
>>>
>>> On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am doing my PHD thesis on large scale machine learning e.g  Online
>>>> learning, batch and mini batch learning.
>>>>
>>>> Could somebody help me with ideas especially in the context of Spark
>>>> and to the above learning methods.
>>>>
>>>> Some ideas like improvement to existing algorithms, implementing new
>>>> features especially the above learning methods and algorithms that have not
>>>> been implemented etc.
>>>>
>>>> If somebody could help me with some ideas it would really accelerate my
>>>> work.
>>>>
>>>> Plus few ideas on research papers regarding Spark or Mahout.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regards
>>>>
>>>
>>>
>>>
>>> --
>>> with Regards
>>> Shahid Ashraf
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>
>>

Re: Research ideas using spark

Posted by Vineel Yalamarthy <vi...@gmail.com>.

Hi Daniel

Well said

Regards
Vineel

On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos <
daniel.darabos@lynxanalytics.com> wrote:

> Hi Shahid,
> To be honest I think this question is better suited for Stack Overflow
> than for a PhD thesis.
>
> On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf <sh...@trialx.com> wrote:
>
>> hi
>>
>> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
>> partitions i get is 9. I am running a spark application , it gets stuck on
>> one of tasks, looking at the UI it seems application is not using all nodes
>> to do calculations. attached is the screen shot of tasks, it seems tasks
>> are put on each node more then once. looking at tasks 8 tasks get completed
>> under 7-8 minutes and one task takes around 30 minutes so causing the delay
>> in results.
>>
>>
>> On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am doing my PHD thesis on large scale machine learning e.g  Online
>>> learning, batch and mini batch learning.
>>>
>>> Could somebody help me with ideas especially in the context of Spark and
>>> to the above learning methods.
>>>
>>> Some ideas like improvement to existing algorithms, implementing new
>>> features especially the above learning methods and algorithms that have not
>>> been implemented etc.
>>>
>>> If somebody could help me with some ideas it would really accelerate my
>>> work.
>>>
>>> Plus few ideas on research papers regarding Spark or Mahout.
>>>
>>> Thanks in advance.
>>>
>>> Regards
>>>
>>
>>
>>
>> --
>> with Regards
>> Shahid Ashraf
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>
>

Re: Research ideas using spark

Posted by Daniel Darabos <da...@lynxanalytics.com>.

Hi Shahid,
To be honest I think this question is better suited for Stack Overflow than
for a PhD thesis.

On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf <sh...@trialx.com> wrote:

> hi
>
> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
> partitions i get is 9. I am running a spark application , it gets stuck on
> one of tasks, looking at the UI it seems application is not using all nodes
> to do calculations. attached is the screen shot of tasks, it seems tasks
> are put on each node more then once. looking at tasks 8 tasks get completed
> under 7-8 minutes and one task takes around 30 minutes so causing the delay
> in results.
>
>
> On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>> Hi,
>>
>> I am doing my PHD thesis on large scale machine learning e.g  Online
>> learning, batch and mini batch learning.
>>
>> Could somebody help me with ideas especially in the context of Spark and
>> to the above learning methods.
>>
>> Some ideas like improvement to existing algorithms, implementing new
>> features especially the above learning methods and algorithms that have not
>> been implemented etc.
>>
>> If somebody could help me with some ideas it would really accelerate my
>> work.
>>
>> Plus few ideas on research papers regarding Spark or Mahout.
>>
>> Thanks in advance.
>>
>> Regards
>>
>
>
>
> --
> with Regards
> Shahid Ashraf
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

Re: Research ideas using spark

Posted by shahid ashraf <sh...@trialx.com>.

hi

I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
partitions i get is 9. I am running a spark application , it gets stuck on
one of tasks, looking at the UI it seems application is not using all nodes
to do calculations. attached is the screen shot of tasks, it seems tasks
are put on each node more then once. looking at tasks 8 tasks get completed
under 7-8 minutes and one task takes around 30 minutes so causing the delay
in results.

On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> wrote:

> Hi,
>
> I am doing my PHD thesis on large scale machine learning e.g  Online
> learning, batch and mini batch learning.
>
> Could somebody help me with ideas especially in the context of Spark and
> to the above learning methods.
>
> Some ideas like improvement to existing algorithms, implementing new
> features especially the above learning methods and algorithms that have not
> been implemented etc.
>
> If somebody could help me with some ideas it would really accelerate my
> work.
>
> Plus few ideas on research papers regarding Spark or Mahout.
>
> Thanks in advance.
>
> Regards
>

-- 
with Regards
Shahid Ashraf

Re: Research ideas using spark

Posted by Michael Segel <ms...@hotmail.com>.

Ok… 

After having some off-line exchanges with Shashidhar Rao came up with an idea…

Apply machine learning to either implement or improve autoscaling up or down within a Storm/Akka cluster. 

While I don’t know what constitutes an acceptable PhD thesis, or senior project for undergrads… this is a real life problem that actually has some real value. 

First, storm doesn’t scale down.  Unless there’s been some improvements in the last year, you really can’t easily scale down the number of workers and transfer state to another worker. 
Looking at Akka, that would be an easier task because of the actor model. However, I don’t know Akka that well, so I can’t say if this is already implemented. 

So besides the mechanism to scale (up and down), you then have the issue of machine learning in terms of load and how to properly scale. 
This could be as simple as a PID function that watches the queues between spout/bolts and bolt/bolt, or something more advanced. This is where the research part of the project comes in. (What do you monitor, and how do you calculate and determine when to scale up or down, weighing in the cost(s) of the action of scaling.) 

Again its a worthwhile project, something that actually has business value, especially in terms of Lambda and other groovy greek lettered names for cluster designs (Zeta? ;-) ) 
Where you have both M/R (computational) and subjective real time (including micro batch) occurring either on the same cluster or within the same DC infrastructure. 

Again I don’t know if this is worthy of a PhD thesis, Masters Thesis, or Senior Project, but it is something that one could sink one’s teeth into and potentially lead to a commercial grade project if done properly. 

Good luck with it.

HTH 

-Mike

> On Jul 15, 2015, at 12:40 PM, vaquar khan <va...@gmail.com> wrote:
> 
> I would suggest study spark ,flink,strom and based on your understanding and finding prepare your research paper.
> 
> May be you will invented new spark ☺
> 
> Regards, 
> Vaquar khan
> 
> On 16 Jul 2015 00:47, "Michael Segel" <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
> Silly question… 
> 
> When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. 
> Or is this an outdated way of thinking? 
> 
> "I am doing my PHD thesis on large scale machine learning e.g  Online learning, batch and mini batch learning.”
> 
> So before we look at technologies like Spark… could the OP break down a more specific concept or idea that he wants to pursue? 
> 
> Looking at what Jorn said… 
> 
> Using machine learning to better predict workloads in terms of managing clusters… This could be interesting… but is it enough for a PhD thesis, or of interest to the OP? 
> 
> 
>> On Jul 15, 2015, at 9:43 AM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing). I miss - and this may not be spark specific - some artificial intelligence to manage a cluster, e.g. Predicting workloads, how long a job may run based on previously executed similar jobs etc. Furthermore, many optimizations you have do to manually, e.g. Bloom filters, partitioning etc - if you find here as well some intelligence that does this automatically based on previously executed jobs taking into account that optimizations themselves change over time would be great... You may also explore feature interaction
>> 
>> Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao <raoshashidhar123@gmail.com <ma...@gmail.com>> a écrit :
>> Hi,
>> 
>> I am doing my PHD thesis on large scale machine learning e.g  Online learning, batch and mini batch learning.
>> 
>> Could somebody help me with ideas especially in the context of Spark and to the above learning methods. 
>> 
>> Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc.
>> 
>> If somebody could help me with some ideas it would really accelerate my work.
>> 
>> Plus few ideas on research papers regarding Spark or Mahout.
>> 
>> Thanks in advance.
>> 
>> Regards 
> 
>

Re: Research ideas using spark

Posted by vaquar khan <va...@gmail.com>.

I would suggest study spark ,flink,strom and based on your understanding
and finding prepare your research paper.

May be you will invented new spark ☺

Regards,
Vaquar khan
On 16 Jul 2015 00:47, "Michael Segel" <ms...@hotmail.com> wrote:

> Silly question…
>
> When thinking about a PhD thesis… do you want to tie it to a specific
> technology or do you want to investigate an idea but then use a specific
> technology.
> Or is this an outdated way of thinking?
>
> "I am doing my PHD thesis on large scale machine learning e.g  Online
> learning, batch and mini batch learning.”
>
> So before we look at technologies like Spark… could the OP break down a
> more specific concept or idea that he wants to pursue?
>
> Looking at what Jorn said…
>
> Using machine learning to better predict workloads in terms of managing
> clusters… This could be interesting… but is it enough for a PhD thesis, or
> of interest to the OP?
>
>
> On Jul 15, 2015, at 9:43 AM, Jörn Franke <jo...@gmail.com> wrote:
>
> Well one of the strength of spark is standardized general distributed
> processing allowing many different types of processing, such as graph
> processing, stream processing etc. The limitation is that it is less
> performant than one system focusing only on one type of processing (eg
> graph processing). I miss - and this may not be spark specific - some
> artificial intelligence to manage a cluster, e.g. Predicting workloads, how
> long a job may run based on previously executed similar jobs etc.
> Furthermore, many optimizations you have do to manually, e.g. Bloom
> filters, partitioning etc - if you find here as well some intelligence that
> does this automatically based on previously executed jobs taking into
> account that optimizations themselves change over time would be great...
> You may also explore feature interaction
>
> Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao <ra...@gmail.com>
> a écrit :
>
>> Hi,
>>
>> I am doing my PHD thesis on large scale machine learning e.g  Online
>> learning, batch and mini batch learning.
>>
>> Could somebody help me with ideas especially in the context of Spark and
>> to the above learning methods.
>>
>> Some ideas like improvement to existing algorithms, implementing new
>> features especially the above learning methods and algorithms that have not
>> been implemented etc.
>>
>> If somebody could help me with some ideas it would really accelerate my
>> work.
>>
>> Plus few ideas on research papers regarding Spark or Mahout.
>>
>> Thanks in advance.
>>
>> Regards
>>
>
>
>

Re: Research ideas using spark

Posted by Michael Segel <ms...@hotmail.com>.

Silly question… 

When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. 
Or is this an outdated way of thinking? 

"I am doing my PHD thesis on large scale machine learning e.g  Online learning, batch and mini batch learning.”

So before we look at technologies like Spark… could the OP break down a more specific concept or idea that he wants to pursue? 

Looking at what Jorn said… 

Using machine learning to better predict workloads in terms of managing clusters… This could be interesting… but is it enough for a PhD thesis, or of interest to the OP? 


> On Jul 15, 2015, at 9:43 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing). I miss - and this may not be spark specific - some artificial intelligence to manage a cluster, e.g. Predicting workloads, how long a job may run based on previously executed similar jobs etc. Furthermore, many optimizations you have do to manually, e.g. Bloom filters, partitioning etc - if you find here as well some intelligence that does this automatically based on previously executed jobs taking into account that optimizations themselves change over time would be great... You may also explore feature interaction
> 
> Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao <raoshashidhar123@gmail.com <ma...@gmail.com>> a écrit :
> Hi,
> 
> I am doing my PHD thesis on large scale machine learning e.g  Online learning, batch and mini batch learning.
> 
> Could somebody help me with ideas especially in the context of Spark and to the above learning methods. 
> 
> Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc.
> 
> If somebody could help me with some ideas it would really accelerate my work.
> 
> Plus few ideas on research papers regarding Spark or Mahout.
> 
> Thanks in advance.
> 
> Regards

Re: Research ideas using spark

Posted by Jörn Franke <jo...@gmail.com>.

Well one of the strength of spark is standardized general distributed
processing allowing many different types of processing, such as graph
processing, stream processing etc. The limitation is that it is less
performant than one system focusing only on one type of processing (eg
graph processing). I miss - and this may not be spark specific - some
artificial intelligence to manage a cluster, e.g. Predicting workloads, how
long a job may run based on previously executed similar jobs etc.
Furthermore, many optimizations you have do to manually, e.g. Bloom
filters, partitioning etc - if you find here as well some intelligence that
does this automatically based on previously executed jobs taking into
account that optimizations themselves change over time would be great...
You may also explore feature interaction

Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao <ra...@gmail.com> a
écrit :

> Hi,
>
> I am doing my PHD thesis on large scale machine learning e.g  Online
> learning, batch and mini batch learning.
>
> Could somebody help me with ideas especially in the context of Spark and
> to the above learning methods.
>
> Some ideas like improvement to existing algorithms, implementing new
> features especially the above learning methods and algorithms that have not
> been implemented etc.
>
> If somebody could help me with some ideas it would really accelerate my
> work.
>
> Plus few ideas on research papers regarding Spark or Mahout.
>
> Thanks in advance.
>
> Regards
>