You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Anshuman Mathur <an...@gmail.com> on 2013/05/21 15:35:55 UTC

Project ideas

Hello fellow users,

We are a group of students studying in National University of Singapore. As
part of our course curriculum we need to develop an application using
Hadoop and  map-reduce. Can you please suggest some innovative ideas for
our project?

Thanks in advance.

Anshuman

Re: Project ideas

Posted by maisnam ns <ma...@gmail.com>.
Here's a nice link to get started

http://hadoopblog.blogspot.in/2010/11/hadoop-research-topics.html

Regards
Niranjan Singh


On Tue, May 21, 2013 at 10:20 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> You want to add any new simple feature to Hadoop or develop an application
> using hadoop.
>
> Sometime back another university student wanted to add encryption to
> HDFS.Its just a pointer.
>
> Just a problem which might interest your university.
>
> Talk to the IT dept of NSU and collect as much as server logs with their
> help. Then write some different MR jobs to analyze those logs and show them
> some useful interesting stats.
>
>
> Thanks,
> Rahul
>
>
> On Tue, May 21, 2013 at 7:16 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Drink heavily?
>>
>> Sorry.
>>
>> Let me rephrase.
>>
>> Part of the exercise is for you, the student to come up with the idea.
>> Not solicit someone else for a suggestion.  This is how you learn.
>>
>> The exercise is to get you to think about the following:
>>
>> 1) What is Hadoop
>> 2) How does it work
>> 3) Why would you want to use it
>>
>> You need to understand #1 and #2 to be able to #3.
>>
>> But at the same time... you need to also incorporate your own view of the
>> world.
>> What are your hobbies? What do you like to do?
>> What scares you the most?  What excites you the most?
>> Why are you here?
>> And most importantly, what do you think you can do within the time
>> period.
>> (What data can you easily capture and work with...)
>>
>> Have you ever seen 'Eden of the East' ? ;-)
>>
>> HTH
>>
>>
>>  On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>>
>> Hello fellow users,
>>
>> We are a group of students studying in National University of Singapore.
>> As part of our course curriculum we need to develop an application using
>> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
>> our project?
>>
>> Thanks in advance.
>>
>> Anshuman
>>
>>
>>
>

Re: Project ideas

Posted by maisnam ns <ma...@gmail.com>.
Here's a nice link to get started

http://hadoopblog.blogspot.in/2010/11/hadoop-research-topics.html

Regards
Niranjan Singh


On Tue, May 21, 2013 at 10:20 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> You want to add any new simple feature to Hadoop or develop an application
> using hadoop.
>
> Sometime back another university student wanted to add encryption to
> HDFS.Its just a pointer.
>
> Just a problem which might interest your university.
>
> Talk to the IT dept of NSU and collect as much as server logs with their
> help. Then write some different MR jobs to analyze those logs and show them
> some useful interesting stats.
>
>
> Thanks,
> Rahul
>
>
> On Tue, May 21, 2013 at 7:16 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Drink heavily?
>>
>> Sorry.
>>
>> Let me rephrase.
>>
>> Part of the exercise is for you, the student to come up with the idea.
>> Not solicit someone else for a suggestion.  This is how you learn.
>>
>> The exercise is to get you to think about the following:
>>
>> 1) What is Hadoop
>> 2) How does it work
>> 3) Why would you want to use it
>>
>> You need to understand #1 and #2 to be able to #3.
>>
>> But at the same time... you need to also incorporate your own view of the
>> world.
>> What are your hobbies? What do you like to do?
>> What scares you the most?  What excites you the most?
>> Why are you here?
>> And most importantly, what do you think you can do within the time
>> period.
>> (What data can you easily capture and work with...)
>>
>> Have you ever seen 'Eden of the East' ? ;-)
>>
>> HTH
>>
>>
>>  On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>>
>> Hello fellow users,
>>
>> We are a group of students studying in National University of Singapore.
>> As part of our course curriculum we need to develop an application using
>> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
>> our project?
>>
>> Thanks in advance.
>>
>> Anshuman
>>
>>
>>
>

Re: Project ideas

Posted by maisnam ns <ma...@gmail.com>.
Here's a nice link to get started

http://hadoopblog.blogspot.in/2010/11/hadoop-research-topics.html

Regards
Niranjan Singh


On Tue, May 21, 2013 at 10:20 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> You want to add any new simple feature to Hadoop or develop an application
> using hadoop.
>
> Sometime back another university student wanted to add encryption to
> HDFS.Its just a pointer.
>
> Just a problem which might interest your university.
>
> Talk to the IT dept of NSU and collect as much as server logs with their
> help. Then write some different MR jobs to analyze those logs and show them
> some useful interesting stats.
>
>
> Thanks,
> Rahul
>
>
> On Tue, May 21, 2013 at 7:16 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Drink heavily?
>>
>> Sorry.
>>
>> Let me rephrase.
>>
>> Part of the exercise is for you, the student to come up with the idea.
>> Not solicit someone else for a suggestion.  This is how you learn.
>>
>> The exercise is to get you to think about the following:
>>
>> 1) What is Hadoop
>> 2) How does it work
>> 3) Why would you want to use it
>>
>> You need to understand #1 and #2 to be able to #3.
>>
>> But at the same time... you need to also incorporate your own view of the
>> world.
>> What are your hobbies? What do you like to do?
>> What scares you the most?  What excites you the most?
>> Why are you here?
>> And most importantly, what do you think you can do within the time
>> period.
>> (What data can you easily capture and work with...)
>>
>> Have you ever seen 'Eden of the East' ? ;-)
>>
>> HTH
>>
>>
>>  On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>>
>> Hello fellow users,
>>
>> We are a group of students studying in National University of Singapore.
>> As part of our course curriculum we need to develop an application using
>> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
>> our project?
>>
>> Thanks in advance.
>>
>> Anshuman
>>
>>
>>
>

Re: Project ideas

Posted by maisnam ns <ma...@gmail.com>.
Here's a nice link to get started

http://hadoopblog.blogspot.in/2010/11/hadoop-research-topics.html

Regards
Niranjan Singh


On Tue, May 21, 2013 at 10:20 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> You want to add any new simple feature to Hadoop or develop an application
> using hadoop.
>
> Sometime back another university student wanted to add encryption to
> HDFS.Its just a pointer.
>
> Just a problem which might interest your university.
>
> Talk to the IT dept of NSU and collect as much as server logs with their
> help. Then write some different MR jobs to analyze those logs and show them
> some useful interesting stats.
>
>
> Thanks,
> Rahul
>
>
> On Tue, May 21, 2013 at 7:16 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Drink heavily?
>>
>> Sorry.
>>
>> Let me rephrase.
>>
>> Part of the exercise is for you, the student to come up with the idea.
>> Not solicit someone else for a suggestion.  This is how you learn.
>>
>> The exercise is to get you to think about the following:
>>
>> 1) What is Hadoop
>> 2) How does it work
>> 3) Why would you want to use it
>>
>> You need to understand #1 and #2 to be able to #3.
>>
>> But at the same time... you need to also incorporate your own view of the
>> world.
>> What are your hobbies? What do you like to do?
>> What scares you the most?  What excites you the most?
>> Why are you here?
>> And most importantly, what do you think you can do within the time
>> period.
>> (What data can you easily capture and work with...)
>>
>> Have you ever seen 'Eden of the East' ? ;-)
>>
>> HTH
>>
>>
>>  On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>>
>> Hello fellow users,
>>
>> We are a group of students studying in National University of Singapore.
>> As part of our course curriculum we need to develop an application using
>> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
>> our project?
>>
>> Thanks in advance.
>>
>> Anshuman
>>
>>
>>
>

Re: Project ideas

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
You want to add any new simple feature to Hadoop or develop an application
using hadoop.

Sometime back another university student wanted to add encryption to
HDFS.Its just a pointer.

Just a problem which might interest your university.

Talk to the IT dept of NSU and collect as much as server logs with their
help. Then write some different MR jobs to analyze those logs and show them
some useful interesting stats.


Thanks,
Rahul


On Tue, May 21, 2013 at 7:16 PM, Michael Segel <mi...@hotmail.com>wrote:

> Drink heavily?
>
> Sorry.
>
> Let me rephrase.
>
> Part of the exercise is for you, the student to come up with the idea. Not
> solicit someone else for a suggestion.  This is how you learn.
>
> The exercise is to get you to think about the following:
>
> 1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
>
> You need to understand #1 and #2 to be able to #3.
>
> But at the same time... you need to also incorporate your own view of the
> world.
> What are your hobbies? What do you like to do?
> What scares you the most?  What excites you the most?
> Why are you here?
> And most importantly, what do you think you can do within the time period.
> (What data can you easily capture and work with...)
>
> Have you ever seen 'Eden of the East' ? ;-)
>
> HTH
>
>
> On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>
> Hello fellow users,
>
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
>
> Thanks in advance.
>
> Anshuman
>
>
>

Re: Project ideas

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
You want to add any new simple feature to Hadoop or develop an application
using hadoop.

Sometime back another university student wanted to add encryption to
HDFS.Its just a pointer.

Just a problem which might interest your university.

Talk to the IT dept of NSU and collect as much as server logs with their
help. Then write some different MR jobs to analyze those logs and show them
some useful interesting stats.


Thanks,
Rahul


On Tue, May 21, 2013 at 7:16 PM, Michael Segel <mi...@hotmail.com>wrote:

> Drink heavily?
>
> Sorry.
>
> Let me rephrase.
>
> Part of the exercise is for you, the student to come up with the idea. Not
> solicit someone else for a suggestion.  This is how you learn.
>
> The exercise is to get you to think about the following:
>
> 1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
>
> You need to understand #1 and #2 to be able to #3.
>
> But at the same time... you need to also incorporate your own view of the
> world.
> What are your hobbies? What do you like to do?
> What scares you the most?  What excites you the most?
> Why are you here?
> And most importantly, what do you think you can do within the time period.
> (What data can you easily capture and work with...)
>
> Have you ever seen 'Eden of the East' ? ;-)
>
> HTH
>
>
> On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>
> Hello fellow users,
>
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
>
> Thanks in advance.
>
> Anshuman
>
>
>

Re: Project ideas

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
You want to add any new simple feature to Hadoop or develop an application
using hadoop.

Sometime back another university student wanted to add encryption to
HDFS.Its just a pointer.

Just a problem which might interest your university.

Talk to the IT dept of NSU and collect as much as server logs with their
help. Then write some different MR jobs to analyze those logs and show them
some useful interesting stats.


Thanks,
Rahul


On Tue, May 21, 2013 at 7:16 PM, Michael Segel <mi...@hotmail.com>wrote:

> Drink heavily?
>
> Sorry.
>
> Let me rephrase.
>
> Part of the exercise is for you, the student to come up with the idea. Not
> solicit someone else for a suggestion.  This is how you learn.
>
> The exercise is to get you to think about the following:
>
> 1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
>
> You need to understand #1 and #2 to be able to #3.
>
> But at the same time... you need to also incorporate your own view of the
> world.
> What are your hobbies? What do you like to do?
> What scares you the most?  What excites you the most?
> Why are you here?
> And most importantly, what do you think you can do within the time period.
> (What data can you easily capture and work with...)
>
> Have you ever seen 'Eden of the East' ? ;-)
>
> HTH
>
>
> On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>
> Hello fellow users,
>
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
>
> Thanks in advance.
>
> Anshuman
>
>
>

Re: Project ideas

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
You want to add any new simple feature to Hadoop or develop an application
using hadoop.

Sometime back another university student wanted to add encryption to
HDFS.Its just a pointer.

Just a problem which might interest your university.

Talk to the IT dept of NSU and collect as much as server logs with their
help. Then write some different MR jobs to analyze those logs and show them
some useful interesting stats.


Thanks,
Rahul


On Tue, May 21, 2013 at 7:16 PM, Michael Segel <mi...@hotmail.com>wrote:

> Drink heavily?
>
> Sorry.
>
> Let me rephrase.
>
> Part of the exercise is for you, the student to come up with the idea. Not
> solicit someone else for a suggestion.  This is how you learn.
>
> The exercise is to get you to think about the following:
>
> 1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
>
> You need to understand #1 and #2 to be able to #3.
>
> But at the same time... you need to also incorporate your own view of the
> world.
> What are your hobbies? What do you like to do?
> What scares you the most?  What excites you the most?
> Why are you here?
> And most importantly, what do you think you can do within the time period.
> (What data can you easily capture and work with...)
>
> Have you ever seen 'Eden of the East' ? ;-)
>
> HTH
>
>
> On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>
> Hello fellow users,
>
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
>
> Thanks in advance.
>
> Anshuman
>
>
>

Re: Project ideas

Posted by Juan Suero <ju...@gmail.com>.
im a newbie but maybe this will also add some value...
it is my understanding that mapreduce is like a distributed "group by"
statement

when you run a statement like this against your petabyes of dataset it can
take a long time.. first and foremost because the first thing you have to
do before you apply the group by logic is to read the data off disk.

if your disk reads at 100/MBs then you can do the math.
The time frame that this query will run take at least this long to complete.

If you need this info really fast like in the next hour to support i dunno
personalization features on a ecommerce site or month end report that needs
to be complete in 2 hours.
Then it would be nice to put equal parts of your data on 100s of disks and
run the same algorithm in parralel

but thats just if your bottleneck is disk.
what if your dataset is relatively small but calculations done on each
element coming in is large
so therefore your bottleneck there is CPU power

there are alot of bottlenecks you could run into.
number of threads
memory
latency of remote apis or remote database you hit as you analyze the data

Theres a book called programming collective intelligence from oreilly that
should help you out too
http://shop.oreilly.com/product/9780596529321.do



On Tue, May 21, 2013 at 11:02 PM, Sai Sai <sa...@yahoo.in> wrote:

> Excellent Sanjay, really excellent input. Many Thanks for this input.
> I have been always thinking about some ideas but never knowing what to
> proceed with.
> Thanks again.
> Sai
>
>   ------------------------------
>  *From:* Sanjay Subramanian <Sa...@wizecommerce.com>
> *To:* "user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent:* Tuesday, 21 May 2013 11:51 PM
> *Subject:* Re: Project ideas
>
>  +1
>
>  My $0.02 is look look around and see problems u can solve…Its better to
> get a list of problems and see if u can model a solution using map-reduce
> framework
>
>  An example is as follows
>
>  PROBLEM
> Build a Cars Pricing Model based on advertisements on Craigs list
>
>  OBJECTIVE
> Recommend a price to the Craigslist car seller when the user gives info
> about make,model,color,miles
>
>  DATA required
> Collect RSS feeds daily from Craigs List (don't pound their website , else
> they will lock u down)
>
>  DESIGN COMPONENTS
> - Daily RSS Collector - pulls data and puts into HDFS
> - Data Loader - Structures the columns u need to analyze and puts into HDFS
> - Hive Aggregator and analyzer - studies and queries data and brings out
> recommendation models for car pricing
> - REST Web service to return query results in XML/JSON
> - iPhone App that talks to web service and gets info
>
>  There u go…this should keep a couple of students busy for 3 months
>
>  I find this kind of problem statement and solutions simpler to
> understand because its all there in the real world !
>
>  An example of my way of thinking led to me founding this non profit
> called www.medicalsidefx.org that gives users valuable metrics regarding
> medical side fx.
> It uses Hadoop to aggregate , Lucene to search….This year I am redesigning
> the core to use Hive :-)
>
>  Good luck
>
>  Sanjay
>
>
>
>
>
>   From: Michael Segel <mi...@hotmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, May 21, 2013 6:46 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Project ideas
>
>  Drink heavily?
>
>  Sorry.
>
>  Let me rephrase.
>
>  Part of the exercise is for you, the student to come up with the idea.
> Not solicit someone else for a suggestion.  This is how you learn.
>
>  The exercise is to get you to think about the following:
>
>  1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
>
>  You need to understand #1 and #2 to be able to #3.
>
>  But at the same time... you need to also incorporate your own view of
> the world.
> What are your hobbies? What do you like to do?
> What scares you the most?  What excites you the most?
> Why are you here?
> And most importantly, what do you think you can do within the time period.
> (What data can you easily capture and work with...)
>
>  Have you ever seen 'Eden of the East' ? ;-)
>
>  HTH
>
>
>  On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>
>  Hello fellow users,
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
> Thanks in advance.
> Anshuman
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Shashidhar Rao <ra...@gmail.com>.
Hi,

Even I have been wanting to know this, I have Oracle VM Virtual box on
windows 7 laptop and inside Oracle VM only one ubuntu instance is running
-how to add multiple virtual machines as Sai Sai has mentioned.

Thanks
Shashidhar


On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

> Just wondering if anyone has any documentation or references to any
> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
> running on multiple ubuntu VMs. any help is appreciated.
> Thanks
> Sai
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jay Vyas <ja...@gmail.com>.
Just FYI if you are on linux, KVM and kickstart are really good for this as
well and we have some kickstart Fedora16 hadoop setup scripts I can share
to spin up a cluster of several VMs on the fly with static IPs (that
usually to me is the tricky part with hadoop VM cluster setup - setting up
the VMs with static ip addresses, getting the nodes to talk / ssh to each
other, and consistently defining the slaves file).

But if you are stuck with VMWare, then i beleive VMWare also has a vagrant
plugin now, which will be much easier for you to maintain.

Manually cloning machines doesnt scale well when you want to rebuild your
cluster.



On Fri, May 31, 2013 at 10:56 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Sai Sai,
>
> You can take a look at that also: http://goo.gl/iXzae
>
> I just did that yesterday for some other folks I'm working with. Maybe
> not the best way, but working like a charm.
>
> JM
>
> 2013/5/31 shashwat shriparv <dw...@gmail.com>:
> > Try this
> > http://www.youtube.com/watch?v=gIRubPl20oo
> > there will be three videos 1-3 watch and you can do what you need to
> do....
> >
> >
> >
> > Thanks & Regards
> >
> > ∞
> >
> > Shashwat Shriparv
> >
> >
> >
> > On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav <
> jeetuyadav200890@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> You can create a clone machine through an existing virtual machine in
> >> VMware and then run it as a separate virtual machine.
> >>
> >> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
> >>
> >>
> >> After installing you have to make sure that all the virtual machines are
> >> setup with correct network set up so that they can ping each other (you
> >> should use Host only network settings in network configuration).
> >>
> >> I hope this will help you.
> >>
> >>
> >> Regards
> >> Jitendra
> >>
> >> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
> >>>
> >>> Just wondering if anyone has any documentation or references to any
> >>> articles how to simulate a multi node cluster setup in 1 laptop with
> hadoop
> >>> running on multiple ubuntu VMs. any help is appreciated.
> >>> Thanks
> >>> Sai
> >>
> >>
> >
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jay Vyas <ja...@gmail.com>.
Just FYI if you are on linux, KVM and kickstart are really good for this as
well and we have some kickstart Fedora16 hadoop setup scripts I can share
to spin up a cluster of several VMs on the fly with static IPs (that
usually to me is the tricky part with hadoop VM cluster setup - setting up
the VMs with static ip addresses, getting the nodes to talk / ssh to each
other, and consistently defining the slaves file).

But if you are stuck with VMWare, then i beleive VMWare also has a vagrant
plugin now, which will be much easier for you to maintain.

Manually cloning machines doesnt scale well when you want to rebuild your
cluster.



On Fri, May 31, 2013 at 10:56 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Sai Sai,
>
> You can take a look at that also: http://goo.gl/iXzae
>
> I just did that yesterday for some other folks I'm working with. Maybe
> not the best way, but working like a charm.
>
> JM
>
> 2013/5/31 shashwat shriparv <dw...@gmail.com>:
> > Try this
> > http://www.youtube.com/watch?v=gIRubPl20oo
> > there will be three videos 1-3 watch and you can do what you need to
> do....
> >
> >
> >
> > Thanks & Regards
> >
> > ∞
> >
> > Shashwat Shriparv
> >
> >
> >
> > On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav <
> jeetuyadav200890@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> You can create a clone machine through an existing virtual machine in
> >> VMware and then run it as a separate virtual machine.
> >>
> >> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
> >>
> >>
> >> After installing you have to make sure that all the virtual machines are
> >> setup with correct network set up so that they can ping each other (you
> >> should use Host only network settings in network configuration).
> >>
> >> I hope this will help you.
> >>
> >>
> >> Regards
> >> Jitendra
> >>
> >> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
> >>>
> >>> Just wondering if anyone has any documentation or references to any
> >>> articles how to simulate a multi node cluster setup in 1 laptop with
> hadoop
> >>> running on multiple ubuntu VMs. any help is appreciated.
> >>> Thanks
> >>> Sai
> >>
> >>
> >
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jay Vyas <ja...@gmail.com>.
Just FYI if you are on linux, KVM and kickstart are really good for this as
well and we have some kickstart Fedora16 hadoop setup scripts I can share
to spin up a cluster of several VMs on the fly with static IPs (that
usually to me is the tricky part with hadoop VM cluster setup - setting up
the VMs with static ip addresses, getting the nodes to talk / ssh to each
other, and consistently defining the slaves file).

But if you are stuck with VMWare, then i beleive VMWare also has a vagrant
plugin now, which will be much easier for you to maintain.

Manually cloning machines doesnt scale well when you want to rebuild your
cluster.



On Fri, May 31, 2013 at 10:56 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Sai Sai,
>
> You can take a look at that also: http://goo.gl/iXzae
>
> I just did that yesterday for some other folks I'm working with. Maybe
> not the best way, but working like a charm.
>
> JM
>
> 2013/5/31 shashwat shriparv <dw...@gmail.com>:
> > Try this
> > http://www.youtube.com/watch?v=gIRubPl20oo
> > there will be three videos 1-3 watch and you can do what you need to
> do....
> >
> >
> >
> > Thanks & Regards
> >
> > ∞
> >
> > Shashwat Shriparv
> >
> >
> >
> > On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav <
> jeetuyadav200890@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> You can create a clone machine through an existing virtual machine in
> >> VMware and then run it as a separate virtual machine.
> >>
> >> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
> >>
> >>
> >> After installing you have to make sure that all the virtual machines are
> >> setup with correct network set up so that they can ping each other (you
> >> should use Host only network settings in network configuration).
> >>
> >> I hope this will help you.
> >>
> >>
> >> Regards
> >> Jitendra
> >>
> >> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
> >>>
> >>> Just wondering if anyone has any documentation or references to any
> >>> articles how to simulate a multi node cluster setup in 1 laptop with
> hadoop
> >>> running on multiple ubuntu VMs. any help is appreciated.
> >>> Thanks
> >>> Sai
> >>
> >>
> >
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jay Vyas <ja...@gmail.com>.
Just FYI if you are on linux, KVM and kickstart are really good for this as
well and we have some kickstart Fedora16 hadoop setup scripts I can share
to spin up a cluster of several VMs on the fly with static IPs (that
usually to me is the tricky part with hadoop VM cluster setup - setting up
the VMs with static ip addresses, getting the nodes to talk / ssh to each
other, and consistently defining the slaves file).

But if you are stuck with VMWare, then i beleive VMWare also has a vagrant
plugin now, which will be much easier for you to maintain.

Manually cloning machines doesnt scale well when you want to rebuild your
cluster.



On Fri, May 31, 2013 at 10:56 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Sai Sai,
>
> You can take a look at that also: http://goo.gl/iXzae
>
> I just did that yesterday for some other folks I'm working with. Maybe
> not the best way, but working like a charm.
>
> JM
>
> 2013/5/31 shashwat shriparv <dw...@gmail.com>:
> > Try this
> > http://www.youtube.com/watch?v=gIRubPl20oo
> > there will be three videos 1-3 watch and you can do what you need to
> do....
> >
> >
> >
> > Thanks & Regards
> >
> > ∞
> >
> > Shashwat Shriparv
> >
> >
> >
> > On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav <
> jeetuyadav200890@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> You can create a clone machine through an existing virtual machine in
> >> VMware and then run it as a separate virtual machine.
> >>
> >> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
> >>
> >>
> >> After installing you have to make sure that all the virtual machines are
> >> setup with correct network set up so that they can ping each other (you
> >> should use Host only network settings in network configuration).
> >>
> >> I hope this will help you.
> >>
> >>
> >> Regards
> >> Jitendra
> >>
> >> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
> >>>
> >>> Just wondering if anyone has any documentation or references to any
> >>> articles how to simulate a multi node cluster setup in 1 laptop with
> hadoop
> >>> running on multiple ubuntu VMs. any help is appreciated.
> >>> Thanks
> >>> Sai
> >>
> >>
> >
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Sai Sai,

You can take a look at that also: http://goo.gl/iXzae

I just did that yesterday for some other folks I'm working with. Maybe
not the best way, but working like a charm.

JM

2013/5/31 shashwat shriparv <dw...@gmail.com>:
> Try this
> http://www.youtube.com/watch?v=gIRubPl20oo
> there will be three videos 1-3 watch and you can do what you need to do....
>
>
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav <je...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> You can create a clone machine through an existing virtual machine in
>> VMware and then run it as a separate virtual machine.
>>
>> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
>>
>>
>> After installing you have to make sure that all the virtual machines are
>> setup with correct network set up so that they can ping each other (you
>> should use Host only network settings in network configuration).
>>
>> I hope this will help you.
>>
>>
>> Regards
>> Jitendra
>>
>> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>>>
>>> Just wondering if anyone has any documentation or references to any
>>> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
>>> running on multiple ubuntu VMs. any help is appreciated.
>>> Thanks
>>> Sai
>>
>>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Sai Sai,

You can take a look at that also: http://goo.gl/iXzae

I just did that yesterday for some other folks I'm working with. Maybe
not the best way, but working like a charm.

JM

2013/5/31 shashwat shriparv <dw...@gmail.com>:
> Try this
> http://www.youtube.com/watch?v=gIRubPl20oo
> there will be three videos 1-3 watch and you can do what you need to do....
>
>
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav <je...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> You can create a clone machine through an existing virtual machine in
>> VMware and then run it as a separate virtual machine.
>>
>> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
>>
>>
>> After installing you have to make sure that all the virtual machines are
>> setup with correct network set up so that they can ping each other (you
>> should use Host only network settings in network configuration).
>>
>> I hope this will help you.
>>
>>
>> Regards
>> Jitendra
>>
>> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>>>
>>> Just wondering if anyone has any documentation or references to any
>>> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
>>> running on multiple ubuntu VMs. any help is appreciated.
>>> Thanks
>>> Sai
>>
>>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Sai Sai,

You can take a look at that also: http://goo.gl/iXzae

I just did that yesterday for some other folks I'm working with. Maybe
not the best way, but working like a charm.

JM

2013/5/31 shashwat shriparv <dw...@gmail.com>:
> Try this
> http://www.youtube.com/watch?v=gIRubPl20oo
> there will be three videos 1-3 watch and you can do what you need to do....
>
>
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav <je...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> You can create a clone machine through an existing virtual machine in
>> VMware and then run it as a separate virtual machine.
>>
>> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
>>
>>
>> After installing you have to make sure that all the virtual machines are
>> setup with correct network set up so that they can ping each other (you
>> should use Host only network settings in network configuration).
>>
>> I hope this will help you.
>>
>>
>> Regards
>> Jitendra
>>
>> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>>>
>>> Just wondering if anyone has any documentation or references to any
>>> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
>>> running on multiple ubuntu VMs. any help is appreciated.
>>> Thanks
>>> Sai
>>
>>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Sai Sai,

You can take a look at that also: http://goo.gl/iXzae

I just did that yesterday for some other folks I'm working with. Maybe
not the best way, but working like a charm.

JM

2013/5/31 shashwat shriparv <dw...@gmail.com>:
> Try this
> http://www.youtube.com/watch?v=gIRubPl20oo
> there will be three videos 1-3 watch and you can do what you need to do....
>
>
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav <je...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> You can create a clone machine through an existing virtual machine in
>> VMware and then run it as a separate virtual machine.
>>
>> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
>>
>>
>> After installing you have to make sure that all the virtual machines are
>> setup with correct network set up so that they can ping each other (you
>> should use Host only network settings in network configuration).
>>
>> I hope this will help you.
>>
>>
>> Regards
>> Jitendra
>>
>> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>>>
>>> Just wondering if anyone has any documentation or references to any
>>> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
>>> running on multiple ubuntu VMs. any help is appreciated.
>>> Thanks
>>> Sai
>>
>>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by shashwat shriparv <dw...@gmail.com>.
Try this
http://www.youtube.com/watch?v=gIRubPl20oo
there will be three videos 1-3 watch and you can do what you need to do....



*Thanks & Regards    *

∞
Shashwat Shriparv



On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav
<je...@gmail.com>wrote:

> Hi,
>
> You can create a clone machine through an existing virtual machine
> in VMware and then run it as a separate virtual machine.
>
> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
>
>
> After installing you have to make sure that all the virtual machines are
> setup with correct network set up so that they can ping each other (you
> should use Host only network settings in network configuration).
>
> I hope this will help you.
>
>
> Regards
> Jitendra
>
> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>>  Just wondering if anyone has any documentation or references to any
>> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
>> running on multiple ubuntu VMs. any help is appreciated.
>> Thanks
>> Sai
>>
>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by shashwat shriparv <dw...@gmail.com>.
Try this
http://www.youtube.com/watch?v=gIRubPl20oo
there will be three videos 1-3 watch and you can do what you need to do....



*Thanks & Regards    *

∞
Shashwat Shriparv



On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav
<je...@gmail.com>wrote:

> Hi,
>
> You can create a clone machine through an existing virtual machine
> in VMware and then run it as a separate virtual machine.
>
> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
>
>
> After installing you have to make sure that all the virtual machines are
> setup with correct network set up so that they can ping each other (you
> should use Host only network settings in network configuration).
>
> I hope this will help you.
>
>
> Regards
> Jitendra
>
> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>>  Just wondering if anyone has any documentation or references to any
>> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
>> running on multiple ubuntu VMs. any help is appreciated.
>> Thanks
>> Sai
>>
>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by shashwat shriparv <dw...@gmail.com>.
Try this
http://www.youtube.com/watch?v=gIRubPl20oo
there will be three videos 1-3 watch and you can do what you need to do....



*Thanks & Regards    *

∞
Shashwat Shriparv



On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav
<je...@gmail.com>wrote:

> Hi,
>
> You can create a clone machine through an existing virtual machine
> in VMware and then run it as a separate virtual machine.
>
> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
>
>
> After installing you have to make sure that all the virtual machines are
> setup with correct network set up so that they can ping each other (you
> should use Host only network settings in network configuration).
>
> I hope this will help you.
>
>
> Regards
> Jitendra
>
> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>>  Just wondering if anyone has any documentation or references to any
>> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
>> running on multiple ubuntu VMs. any help is appreciated.
>> Thanks
>> Sai
>>
>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by shashwat shriparv <dw...@gmail.com>.
Try this
http://www.youtube.com/watch?v=gIRubPl20oo
there will be three videos 1-3 watch and you can do what you need to do....



*Thanks & Regards    *

∞
Shashwat Shriparv



On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav
<je...@gmail.com>wrote:

> Hi,
>
> You can create a clone machine through an existing virtual machine
> in VMware and then run it as a separate virtual machine.
>
> http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
>
>
> After installing you have to make sure that all the virtual machines are
> setup with correct network set up so that they can ping each other (you
> should use Host only network settings in network configuration).
>
> I hope this will help you.
>
>
> Regards
> Jitendra
>
> On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>>  Just wondering if anyone has any documentation or references to any
>> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
>> running on multiple ubuntu VMs. any help is appreciated.
>> Thanks
>> Sai
>>
>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jitendra Yadav <je...@gmail.com>.
Hi,

You can create a clone machine through an existing virtual machine
in VMware and then run it as a separate virtual machine.

http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html


After installing you have to make sure that all the virtual machines are
setup with correct network set up so that they can ping each other (you
should use Host only network settings in network configuration).

I hope this will help you.


Regards
Jitendra

On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

>  Just wondering if anyone has any documentation or references to any
> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
> running on multiple ubuntu VMs. any help is appreciated.
> Thanks
> Sai
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Shashidhar Rao <ra...@gmail.com>.
Hi,

Even I have been wanting to know this, I have Oracle VM Virtual box on
windows 7 laptop and inside Oracle VM only one ubuntu instance is running
-how to add multiple virtual machines as Sai Sai has mentioned.

Thanks
Shashidhar


On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

> Just wondering if anyone has any documentation or references to any
> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
> running on multiple ubuntu VMs. any help is appreciated.
> Thanks
> Sai
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jitendra Yadav <je...@gmail.com>.
Hi,

You can create a clone machine through an existing virtual machine
in VMware and then run it as a separate virtual machine.

http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html


After installing you have to make sure that all the virtual machines are
setup with correct network set up so that they can ping each other (you
should use Host only network settings in network configuration).

I hope this will help you.


Regards
Jitendra

On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

>  Just wondering if anyone has any documentation or references to any
> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
> running on multiple ubuntu VMs. any help is appreciated.
> Thanks
> Sai
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jitendra Yadav <je...@gmail.com>.
Hi,

You can create a clone machine through an existing virtual machine
in VMware and then run it as a separate virtual machine.

http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html


After installing you have to make sure that all the virtual machines are
setup with correct network set up so that they can ping each other (you
should use Host only network settings in network configuration).

I hope this will help you.


Regards
Jitendra

On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

>  Just wondering if anyone has any documentation or references to any
> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
> running on multiple ubuntu VMs. any help is appreciated.
> Thanks
> Sai
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Shashidhar Rao <ra...@gmail.com>.
Hi,

Even I have been wanting to know this, I have Oracle VM Virtual box on
windows 7 laptop and inside Oracle VM only one ubuntu instance is running
-how to add multiple virtual machines as Sai Sai has mentioned.

Thanks
Shashidhar


On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

> Just wondering if anyone has any documentation or references to any
> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
> running on multiple ubuntu VMs. any help is appreciated.
> Thanks
> Sai
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Shashidhar Rao <ra...@gmail.com>.
Hi,

Even I have been wanting to know this, I have Oracle VM Virtual box on
windows 7 laptop and inside Oracle VM only one ubuntu instance is running
-how to add multiple virtual machines as Sai Sai has mentioned.

Thanks
Shashidhar


On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

> Just wondering if anyone has any documentation or references to any
> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
> running on multiple ubuntu VMs. any help is appreciated.
> Thanks
> Sai
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Jitendra Yadav <je...@gmail.com>.
Hi,

You can create a clone machine through an existing virtual machine
in VMware and then run it as a separate virtual machine.

http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html


After installing you have to make sure that all the virtual machines are
setup with correct network set up so that they can ping each other (you
should use Host only network settings in network configuration).

I hope this will help you.


Regards
Jitendra

On Fri, May 31, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

>  Just wondering if anyone has any documentation or references to any
> articles how to simulate a multi node cluster setup in 1 laptop with hadoop
> running on multiple ubuntu VMs. any help is appreciated.
> Thanks
> Sai
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Sai Sai <sa...@yahoo.in>.
Just wondering if anyone has any documentation or references to any articles how to simulate a multi node cluster setup in 1 laptop with hadoop running on multiple ubuntu VMs. any help is appreciated.
Thanks
Sai

Re: Project ideas

Posted by Juan Suero <ju...@gmail.com>.
im a newbie but maybe this will also add some value...
it is my understanding that mapreduce is like a distributed "group by"
statement

when you run a statement like this against your petabyes of dataset it can
take a long time.. first and foremost because the first thing you have to
do before you apply the group by logic is to read the data off disk.

if your disk reads at 100/MBs then you can do the math.
The time frame that this query will run take at least this long to complete.

If you need this info really fast like in the next hour to support i dunno
personalization features on a ecommerce site or month end report that needs
to be complete in 2 hours.
Then it would be nice to put equal parts of your data on 100s of disks and
run the same algorithm in parralel

but thats just if your bottleneck is disk.
what if your dataset is relatively small but calculations done on each
element coming in is large
so therefore your bottleneck there is CPU power

there are alot of bottlenecks you could run into.
number of threads
memory
latency of remote apis or remote database you hit as you analyze the data

Theres a book called programming collective intelligence from oreilly that
should help you out too
http://shop.oreilly.com/product/9780596529321.do



On Tue, May 21, 2013 at 11:02 PM, Sai Sai <sa...@yahoo.in> wrote:

> Excellent Sanjay, really excellent input. Many Thanks for this input.
> I have been always thinking about some ideas but never knowing what to
> proceed with.
> Thanks again.
> Sai
>
>   ------------------------------
>  *From:* Sanjay Subramanian <Sa...@wizecommerce.com>
> *To:* "user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent:* Tuesday, 21 May 2013 11:51 PM
> *Subject:* Re: Project ideas
>
>  +1
>
>  My $0.02 is look look around and see problems u can solve…Its better to
> get a list of problems and see if u can model a solution using map-reduce
> framework
>
>  An example is as follows
>
>  PROBLEM
> Build a Cars Pricing Model based on advertisements on Craigs list
>
>  OBJECTIVE
> Recommend a price to the Craigslist car seller when the user gives info
> about make,model,color,miles
>
>  DATA required
> Collect RSS feeds daily from Craigs List (don't pound their website , else
> they will lock u down)
>
>  DESIGN COMPONENTS
> - Daily RSS Collector - pulls data and puts into HDFS
> - Data Loader - Structures the columns u need to analyze and puts into HDFS
> - Hive Aggregator and analyzer - studies and queries data and brings out
> recommendation models for car pricing
> - REST Web service to return query results in XML/JSON
> - iPhone App that talks to web service and gets info
>
>  There u go…this should keep a couple of students busy for 3 months
>
>  I find this kind of problem statement and solutions simpler to
> understand because its all there in the real world !
>
>  An example of my way of thinking led to me founding this non profit
> called www.medicalsidefx.org that gives users valuable metrics regarding
> medical side fx.
> It uses Hadoop to aggregate , Lucene to search….This year I am redesigning
> the core to use Hive :-)
>
>  Good luck
>
>  Sanjay
>
>
>
>
>
>   From: Michael Segel <mi...@hotmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, May 21, 2013 6:46 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Project ideas
>
>  Drink heavily?
>
>  Sorry.
>
>  Let me rephrase.
>
>  Part of the exercise is for you, the student to come up with the idea.
> Not solicit someone else for a suggestion.  This is how you learn.
>
>  The exercise is to get you to think about the following:
>
>  1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
>
>  You need to understand #1 and #2 to be able to #3.
>
>  But at the same time... you need to also incorporate your own view of
> the world.
> What are your hobbies? What do you like to do?
> What scares you the most?  What excites you the most?
> Why are you here?
> And most importantly, what do you think you can do within the time period.
> (What data can you easily capture and work with...)
>
>  Have you ever seen 'Eden of the East' ? ;-)
>
>  HTH
>
>
>  On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>
>  Hello fellow users,
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
> Thanks in advance.
> Anshuman
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Sai Sai <sa...@yahoo.in>.
Just wondering if anyone has any documentation or references to any articles how to simulate a multi node cluster setup in 1 laptop with hadoop running on multiple ubuntu VMs. any help is appreciated.
Thanks
Sai

Re: Why/When partitioner is used.

Posted by Bryan Beaudreault <bb...@hubspot.com>.
There are practical applications for defining your own partitioner as well:

1) Controlling database concurrency.  For instance, lets say you have a
distributed datastore like HBase or even your own mysql sharding scheme.
 Using the default HashPartitioner, keys will get for the most part
randomly distributed across your reducers.  If your reduce code does
database saves or gets, this could cause periods where all reducers are
hitting a single database.  This may be more concurrency than your database
can handle, so you could use a partitioner to send all keys you know would
hit Shard A to reducers 1,2,3, and and all that would hit Shard B to
reducers 4,5,6.

2) I've also used partitioners when I want to do some cross-key operations
such as deduping, counting, or otherwise.  You can further combine the
custom partitioner with your own custom comparator and grouping comparator
to do many advanced operations based the application you are working on.

Since a single Reducer instance is used to reduce() all tuples in a
partition, being able to control exactly which records make it onto a
partition is a hugely valuable tool.


On Fri, Jun 7, 2013 at 10:03 AM, John Lilley <jo...@redpoint.net>wrote:

>  There are kind of two parts to this.  The semantics of MapReduce promise
> that all tuples sharing the same key value are sent to the same reducer, so
> that you can write useful MR applications that do things like “count words”
> or “summarize by date”.  In order to accomplish that, the shuffle phase of
> MR performs a partitioning by key to move tuples sharing the same key to
> the same node where they can be processed together.  You can think of
> key-partitioning as a strategy that assists in parallel distributed sorting.
> ****
>
> john****
>
> ** **
>
> *From:* Sai Sai [mailto:saigraph@yahoo.in]
> *Sent:* Friday, June 07, 2013 5:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Why/When partitioner is used.****
>
> ** **
>
> I always get confused why we should partition and what is the use of it.**
> **
>
> Why would one want to send all the keys starting with A to Reducer1 and B
> to R2 and so on...****
>
> Is it just to parallelize the reduce process.****
>
> Please help.****
>
> Thanks****
>
> Sai****
>

Re: Why/When partitioner is used.

Posted by Bryan Beaudreault <bb...@hubspot.com>.
There are practical applications for defining your own partitioner as well:

1) Controlling database concurrency.  For instance, lets say you have a
distributed datastore like HBase or even your own mysql sharding scheme.
 Using the default HashPartitioner, keys will get for the most part
randomly distributed across your reducers.  If your reduce code does
database saves or gets, this could cause periods where all reducers are
hitting a single database.  This may be more concurrency than your database
can handle, so you could use a partitioner to send all keys you know would
hit Shard A to reducers 1,2,3, and and all that would hit Shard B to
reducers 4,5,6.

2) I've also used partitioners when I want to do some cross-key operations
such as deduping, counting, or otherwise.  You can further combine the
custom partitioner with your own custom comparator and grouping comparator
to do many advanced operations based the application you are working on.

Since a single Reducer instance is used to reduce() all tuples in a
partition, being able to control exactly which records make it onto a
partition is a hugely valuable tool.


On Fri, Jun 7, 2013 at 10:03 AM, John Lilley <jo...@redpoint.net>wrote:

>  There are kind of two parts to this.  The semantics of MapReduce promise
> that all tuples sharing the same key value are sent to the same reducer, so
> that you can write useful MR applications that do things like “count words”
> or “summarize by date”.  In order to accomplish that, the shuffle phase of
> MR performs a partitioning by key to move tuples sharing the same key to
> the same node where they can be processed together.  You can think of
> key-partitioning as a strategy that assists in parallel distributed sorting.
> ****
>
> john****
>
> ** **
>
> *From:* Sai Sai [mailto:saigraph@yahoo.in]
> *Sent:* Friday, June 07, 2013 5:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Why/When partitioner is used.****
>
> ** **
>
> I always get confused why we should partition and what is the use of it.**
> **
>
> Why would one want to send all the keys starting with A to Reducer1 and B
> to R2 and so on...****
>
> Is it just to parallelize the reduce process.****
>
> Please help.****
>
> Thanks****
>
> Sai****
>

Re: Why/When partitioner is used.

Posted by Bryan Beaudreault <bb...@hubspot.com>.
There are practical applications for defining your own partitioner as well:

1) Controlling database concurrency.  For instance, lets say you have a
distributed datastore like HBase or even your own mysql sharding scheme.
 Using the default HashPartitioner, keys will get for the most part
randomly distributed across your reducers.  If your reduce code does
database saves or gets, this could cause periods where all reducers are
hitting a single database.  This may be more concurrency than your database
can handle, so you could use a partitioner to send all keys you know would
hit Shard A to reducers 1,2,3, and and all that would hit Shard B to
reducers 4,5,6.

2) I've also used partitioners when I want to do some cross-key operations
such as deduping, counting, or otherwise.  You can further combine the
custom partitioner with your own custom comparator and grouping comparator
to do many advanced operations based the application you are working on.

Since a single Reducer instance is used to reduce() all tuples in a
partition, being able to control exactly which records make it onto a
partition is a hugely valuable tool.


On Fri, Jun 7, 2013 at 10:03 AM, John Lilley <jo...@redpoint.net>wrote:

>  There are kind of two parts to this.  The semantics of MapReduce promise
> that all tuples sharing the same key value are sent to the same reducer, so
> that you can write useful MR applications that do things like “count words”
> or “summarize by date”.  In order to accomplish that, the shuffle phase of
> MR performs a partitioning by key to move tuples sharing the same key to
> the same node where they can be processed together.  You can think of
> key-partitioning as a strategy that assists in parallel distributed sorting.
> ****
>
> john****
>
> ** **
>
> *From:* Sai Sai [mailto:saigraph@yahoo.in]
> *Sent:* Friday, June 07, 2013 5:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Why/When partitioner is used.****
>
> ** **
>
> I always get confused why we should partition and what is the use of it.**
> **
>
> Why would one want to send all the keys starting with A to Reducer1 and B
> to R2 and so on...****
>
> Is it just to parallelize the reduce process.****
>
> Please help.****
>
> Thanks****
>
> Sai****
>

Re: Why/When partitioner is used.

Posted by Bryan Beaudreault <bb...@hubspot.com>.
There are practical applications for defining your own partitioner as well:

1) Controlling database concurrency.  For instance, lets say you have a
distributed datastore like HBase or even your own mysql sharding scheme.
 Using the default HashPartitioner, keys will get for the most part
randomly distributed across your reducers.  If your reduce code does
database saves or gets, this could cause periods where all reducers are
hitting a single database.  This may be more concurrency than your database
can handle, so you could use a partitioner to send all keys you know would
hit Shard A to reducers 1,2,3, and and all that would hit Shard B to
reducers 4,5,6.

2) I've also used partitioners when I want to do some cross-key operations
such as deduping, counting, or otherwise.  You can further combine the
custom partitioner with your own custom comparator and grouping comparator
to do many advanced operations based the application you are working on.

Since a single Reducer instance is used to reduce() all tuples in a
partition, being able to control exactly which records make it onto a
partition is a hugely valuable tool.


On Fri, Jun 7, 2013 at 10:03 AM, John Lilley <jo...@redpoint.net>wrote:

>  There are kind of two parts to this.  The semantics of MapReduce promise
> that all tuples sharing the same key value are sent to the same reducer, so
> that you can write useful MR applications that do things like “count words”
> or “summarize by date”.  In order to accomplish that, the shuffle phase of
> MR performs a partitioning by key to move tuples sharing the same key to
> the same node where they can be processed together.  You can think of
> key-partitioning as a strategy that assists in parallel distributed sorting.
> ****
>
> john****
>
> ** **
>
> *From:* Sai Sai [mailto:saigraph@yahoo.in]
> *Sent:* Friday, June 07, 2013 5:17 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Why/When partitioner is used.****
>
> ** **
>
> I always get confused why we should partition and what is the use of it.**
> **
>
> Why would one want to send all the keys starting with A to Reducer1 and B
> to R2 and so on...****
>
> Is it just to parallelize the reduce process.****
>
> Please help.****
>
> Thanks****
>
> Sai****
>

RE: Why/When partitioner is used.

Posted by John Lilley <jo...@redpoint.net>.
There are kind of two parts to this.  The semantics of MapReduce promise that all tuples sharing the same key value are sent to the same reducer, so that you can write useful MR applications that do things like “count words” or “summarize by date”.  In order to accomplish that, the shuffle phase of MR performs a partitioning by key to move tuples sharing the same key to the same node where they can be processed together.  You can think of key-partitioning as a strategy that assists in parallel distributed sorting.
john

From: Sai Sai [mailto:saigraph@yahoo.in]
Sent: Friday, June 07, 2013 5:17 AM
To: user@hadoop.apache.org
Subject: Re: Why/When partitioner is used.

I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

RE: Why/When partitioner is used.

Posted by John Lilley <jo...@redpoint.net>.
There are kind of two parts to this.  The semantics of MapReduce promise that all tuples sharing the same key value are sent to the same reducer, so that you can write useful MR applications that do things like “count words” or “summarize by date”.  In order to accomplish that, the shuffle phase of MR performs a partitioning by key to move tuples sharing the same key to the same node where they can be processed together.  You can think of key-partitioning as a strategy that assists in parallel distributed sorting.
john

From: Sai Sai [mailto:saigraph@yahoo.in]
Sent: Friday, June 07, 2013 5:17 AM
To: user@hadoop.apache.org
Subject: Re: Why/When partitioner is used.

I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

Re: Is it possible to define num of mappers to run for a job

Posted by Sai Sai <sa...@yahoo.in>.
Is it possible to define num of mappers to run for a job.

What r the conditions we need to be aware of when defining such a thing.
Please help.
Thanks
Sai

Re: Is it possible to define num of mappers to run for a job

Posted by Sai Sai <sa...@yahoo.in>.
Is it possible to define num of mappers to run for a job.

What r the conditions we need to be aware of when defining such a thing.
Please help.
Thanks
Sai

Re: Pool & slot questions

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Totally agree with Shahab,

just a quick answer, but detail is your homework

> Can we think of a job pool similar to a queue.
I do think, partition the slot resource into different chunk size.
FS. inside can be choose between FIFO or FAIR
Queue. it's FIFO.
cool thing about queue in Yarn is sub-pool check it out...


>  Is it possible to configure a slot if so how.
http://lmgtfy.com/?q=fair+scheduler+hadoop+tutorial


Good luck

On Jun 7, 2013, at 6:10 AM, Shahab Yunus <sh...@gmail.com>>
 wrote:

Sai,

This is regarding all your recent emails and questions. I suggest that you read Hadoop: The Definitive Guide by Tom White (http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as it goes through all of your queries in detail and with examples. The questions that you are asking are pretty basic and the answers are available and well documented all over the web. In parallel you can also download the code which is free and easily available and start looking into them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai <sa...@yahoo.in>> wrote:
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks
Sai







Re: Pool & slot questions

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Totally agree with Shahab,

just a quick answer, but detail is your homework

> Can we think of a job pool similar to a queue.
I do think, partition the slot resource into different chunk size.
FS. inside can be choose between FIFO or FAIR
Queue. it's FIFO.
cool thing about queue in Yarn is sub-pool check it out...


>  Is it possible to configure a slot if so how.
http://lmgtfy.com/?q=fair+scheduler+hadoop+tutorial


Good luck

On Jun 7, 2013, at 6:10 AM, Shahab Yunus <sh...@gmail.com>>
 wrote:

Sai,

This is regarding all your recent emails and questions. I suggest that you read Hadoop: The Definitive Guide by Tom White (http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as it goes through all of your queries in detail and with examples. The questions that you are asking are pretty basic and the answers are available and well documented all over the web. In parallel you can also download the code which is free and easily available and start looking into them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai <sa...@yahoo.in>> wrote:
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks
Sai







Re: Pool & slot questions

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Totally agree with Shahab,

just a quick answer, but detail is your homework

> Can we think of a job pool similar to a queue.
I do think, partition the slot resource into different chunk size.
FS. inside can be choose between FIFO or FAIR
Queue. it's FIFO.
cool thing about queue in Yarn is sub-pool check it out...


>  Is it possible to configure a slot if so how.
http://lmgtfy.com/?q=fair+scheduler+hadoop+tutorial


Good luck

On Jun 7, 2013, at 6:10 AM, Shahab Yunus <sh...@gmail.com>>
 wrote:

Sai,

This is regarding all your recent emails and questions. I suggest that you read Hadoop: The Definitive Guide by Tom White (http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as it goes through all of your queries in detail and with examples. The questions that you are asking are pretty basic and the answers are available and well documented all over the web. In parallel you can also download the code which is free and easily available and start looking into them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai <sa...@yahoo.in>> wrote:
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks
Sai







Re: Pool & slot questions

Posted by Patai Sangbutsarakum <Pa...@turn.com>.
Totally agree with Shahab,

just a quick answer, but detail is your homework

> Can we think of a job pool similar to a queue.
I do think, partition the slot resource into different chunk size.
FS. inside can be choose between FIFO or FAIR
Queue. it's FIFO.
cool thing about queue in Yarn is sub-pool check it out...


>  Is it possible to configure a slot if so how.
http://lmgtfy.com/?q=fair+scheduler+hadoop+tutorial


Good luck

On Jun 7, 2013, at 6:10 AM, Shahab Yunus <sh...@gmail.com>>
 wrote:

Sai,

This is regarding all your recent emails and questions. I suggest that you read Hadoop: The Definitive Guide by Tom White (http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as it goes through all of your queries in detail and with examples. The questions that you are asking are pretty basic and the answers are available and well documented all over the web. In parallel you can also download the code which is free and easily available and start looking into them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai <sa...@yahoo.in>> wrote:
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks
Sai







Re: Pool & slot questions

Posted by Shahab Yunus <sh...@gmail.com>.
Sai,

This is regarding all your recent emails and questions. I suggest that you
read Hadoop: The Definitive Guide by Tom White (
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as
it goes through all of your queries in detail and with examples. The
questions that you are asking are pretty basic and the answers are
available and well documented all over the web. In parallel you can also
download the code which is free and easily available and start looking into
them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai <sa...@yahoo.in> wrote:

> 1. Can we think of a job pool similar to a queue.
>
> 2. Is it possible to configure a slot if so how.
>
> Please help.
> Thanks
> Sai
>
>
>
>
>

Re: Pool & slot questions

Posted by Shahab Yunus <sh...@gmail.com>.
Sai,

This is regarding all your recent emails and questions. I suggest that you
read Hadoop: The Definitive Guide by Tom White (
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as
it goes through all of your queries in detail and with examples. The
questions that you are asking are pretty basic and the answers are
available and well documented all over the web. In parallel you can also
download the code which is free and easily available and start looking into
them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai <sa...@yahoo.in> wrote:

> 1. Can we think of a job pool similar to a queue.
>
> 2. Is it possible to configure a slot if so how.
>
> Please help.
> Thanks
> Sai
>
>
>
>
>

Re: Pool & slot questions

Posted by Shahab Yunus <sh...@gmail.com>.
Sai,

This is regarding all your recent emails and questions. I suggest that you
read Hadoop: The Definitive Guide by Tom White (
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as
it goes through all of your queries in detail and with examples. The
questions that you are asking are pretty basic and the answers are
available and well documented all over the web. In parallel you can also
download the code which is free and easily available and start looking into
them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai <sa...@yahoo.in> wrote:

> 1. Can we think of a job pool similar to a queue.
>
> 2. Is it possible to configure a slot if so how.
>
> Please help.
> Thanks
> Sai
>
>
>
>
>

Re: Pool & slot questions

Posted by Shahab Yunus <sh...@gmail.com>.
Sai,

This is regarding all your recent emails and questions. I suggest that you
read Hadoop: The Definitive Guide by Tom White (
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as
it goes through all of your queries in detail and with examples. The
questions that you are asking are pretty basic and the answers are
available and well documented all over the web. In parallel you can also
download the code which is free and easily available and start looking into
them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai <sa...@yahoo.in> wrote:

> 1. Can we think of a job pool similar to a queue.
>
> 2. Is it possible to configure a slot if so how.
>
> Please help.
> Thanks
> Sai
>
>
>
>
>

Re: Pool & slot questions

Posted by Sai Sai <sa...@yahoo.in>.
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks

Sai

Re: Pool & slot questions

Posted by Sai Sai <sa...@yahoo.in>.
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks

Sai

Re: Pool & slot questions

Posted by Sai Sai <sa...@yahoo.in>.
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks

Sai

Re: Is it possible to define num of mappers to run for a job

Posted by Sai Sai <sa...@yahoo.in>.
Is it possible to define num of mappers to run for a job.

What r the conditions we need to be aware of when defining such a thing.
Please help.
Thanks
Sai

Re: Pool & slot questions

Posted by Sai Sai <sa...@yahoo.in>.
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks

Sai

Re: Is it possible to define num of mappers to run for a job

Posted by Sai Sai <sa...@yahoo.in>.
Is it possible to define num of mappers to run for a job.

What r the conditions we need to be aware of when defining such a thing.
Please help.
Thanks
Sai

Re: Is counter a static var

Posted by Sai Sai <sa...@yahoo.in>.
Is counter like a static var. If so is it persisted on the name node or data node.
Any input please.

Thanks
Sai

Re: Is counter a static var

Posted by Sai Sai <sa...@yahoo.in>.
Is counter like a static var. If so is it persisted on the name node or data node.
Any input please.

Thanks
Sai

Re: Is counter a static var

Posted by Sai Sai <sa...@yahoo.in>.
Is counter like a static var. If so is it persisted on the name node or data node.
Any input please.

Thanks
Sai

Re: Is counter a static var

Posted by Sai Sai <sa...@yahoo.in>.
Is counter like a static var. If so is it persisted on the name node or data node.
Any input please.

Thanks
Sai

Re: How hadoop processes image or video files

Posted by Sai Sai <sa...@yahoo.in>.
How r the image files or video files processed using hadoop.
I understand that the byte[] is read by Hadoop using SeqFileFormat in map format but what is done after that 
with this byte[] as it is something which does not make sense in its raw form.
Any input please.

Thanks
Sai

Re: Why/When partitioner is used.

Posted by Harsh J <ha...@cloudera.com>.
Why not also ask yourself, what if you do not send all keys to the
same reducer? Would you get the results you desire that way? :)

On Fri, Jun 7, 2013 at 4:47 PM, Sai Sai <sa...@yahoo.in> wrote:
> I always get confused why we should partition and what is the use of it.
> Why would one want to send all the keys starting with A to Reducer1 and B to
> R2 and so on...
> Is it just to parallelize the reduce process.
> Please help.
> Thanks
> Sai



-- 
Harsh J

RE: Why/When partitioner is used.

Posted by John Lilley <jo...@redpoint.net>.
There are kind of two parts to this.  The semantics of MapReduce promise that all tuples sharing the same key value are sent to the same reducer, so that you can write useful MR applications that do things like “count words” or “summarize by date”.  In order to accomplish that, the shuffle phase of MR performs a partitioning by key to move tuples sharing the same key to the same node where they can be processed together.  You can think of key-partitioning as a strategy that assists in parallel distributed sorting.
john

From: Sai Sai [mailto:saigraph@yahoo.in]
Sent: Friday, June 07, 2013 5:17 AM
To: user@hadoop.apache.org
Subject: Re: Why/When partitioner is used.

I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

Re: Why/When partitioner is used.

Posted by Harsh J <ha...@cloudera.com>.
Why not also ask yourself, what if you do not send all keys to the
same reducer? Would you get the results you desire that way? :)

On Fri, Jun 7, 2013 at 4:47 PM, Sai Sai <sa...@yahoo.in> wrote:
> I always get confused why we should partition and what is the use of it.
> Why would one want to send all the keys starting with A to Reducer1 and B to
> R2 and so on...
> Is it just to parallelize the reduce process.
> Please help.
> Thanks
> Sai



-- 
Harsh J

Re: Why/When partitioner is used.

Posted by Harsh J <ha...@cloudera.com>.
Why not also ask yourself, what if you do not send all keys to the
same reducer? Would you get the results you desire that way? :)

On Fri, Jun 7, 2013 at 4:47 PM, Sai Sai <sa...@yahoo.in> wrote:
> I always get confused why we should partition and what is the use of it.
> Why would one want to send all the keys starting with A to Reducer1 and B to
> R2 and so on...
> Is it just to parallelize the reduce process.
> Please help.
> Thanks
> Sai



-- 
Harsh J

RE: Why/When partitioner is used.

Posted by John Lilley <jo...@redpoint.net>.
There are kind of two parts to this.  The semantics of MapReduce promise that all tuples sharing the same key value are sent to the same reducer, so that you can write useful MR applications that do things like “count words” or “summarize by date”.  In order to accomplish that, the shuffle phase of MR performs a partitioning by key to move tuples sharing the same key to the same node where they can be processed together.  You can think of key-partitioning as a strategy that assists in parallel distributed sorting.
john

From: Sai Sai [mailto:saigraph@yahoo.in]
Sent: Friday, June 07, 2013 5:17 AM
To: user@hadoop.apache.org
Subject: Re: Why/When partitioner is used.

I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

Re: How hadoop processes image or video files

Posted by Sai Sai <sa...@yahoo.in>.
How r the image files or video files processed using hadoop.
I understand that the byte[] is read by Hadoop using SeqFileFormat in map format but what is done after that 
with this byte[] as it is something which does not make sense in its raw form.
Any input please.

Thanks
Sai

Re: Why/When partitioner is used.

Posted by Harsh J <ha...@cloudera.com>.
Why not also ask yourself, what if you do not send all keys to the
same reducer? Would you get the results you desire that way? :)

On Fri, Jun 7, 2013 at 4:47 PM, Sai Sai <sa...@yahoo.in> wrote:
> I always get confused why we should partition and what is the use of it.
> Why would one want to send all the keys starting with A to Reducer1 and B to
> R2 and so on...
> Is it just to parallelize the reduce process.
> Please help.
> Thanks
> Sai



-- 
Harsh J

Re: How hadoop processes image or video files

Posted by Sai Sai <sa...@yahoo.in>.
How r the image files or video files processed using hadoop.
I understand that the byte[] is read by Hadoop using SeqFileFormat in map format but what is done after that 
with this byte[] as it is something which does not make sense in its raw form.
Any input please.

Thanks
Sai

Re: How hadoop processes image or video files

Posted by Sai Sai <sa...@yahoo.in>.
How r the image files or video files processed using hadoop.
I understand that the byte[] is read by Hadoop using SeqFileFormat in map format but what is done after that 
with this byte[] as it is something which does not make sense in its raw form.
Any input please.

Thanks
Sai

Re: Why/When partitioner is used.

Posted by Sai Sai <sa...@yahoo.in>.
I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

Re: diff between these 2 dirs

Posted by Sai Sai <sa...@yahoo.in>.
Just wondering if someone can explain what is the diff between these 2 dirs:

Contents of directory /home/satish/work/mapred/staging/satish/.staging

and this dir:
/hadoop/mapred/system


Thanks
Sai

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Ellis Miller <ou...@gmail.com>.
Configure private cloud: install VMWare / VirtualBox / KVM on internal
server / cluster and levearge either Cloudera Hadoop (free version) or
Hortonworks (Yahoo introduced Hortonworks and where Cloudera is exceptional
but proprietary  Hortonworks requires some configuration and tuning of
Hadoop in their promise to keep their Hadoop offering entirely open
source).

Both have Hadoop appliances which can be imported into VMWare / VirtualBox.
Couple this with, perhaps, using Amazon Web Services for production hosting
and all development can be done in-house with Eclipse and appropriate
Source Control Management.

With Amazon don't have to worry so much about your proprietary code being
stolen as the biggest concern (depending on your HA requirements) have been
several high-profile outages where their servers in VA (when you subscribe
to Amazon Web Services and create a Hadoop cluster, for example, it asks
you to specify the geographical region).

Nice thing about using Virtualization in-house for development, test,
integration environments your developers can code and perform initial
testing in Dev using VMWare / VirtualBox to then take a snapshot and export
the Dev environment then import into production. VWare and VirtualBox both
provide proprietary means to accelerate this process streamlining an Agile
/ Extreme Programming software development framework while simplifying the
release cycle.

Personally, if I were going to do it on the cheap and keep even production
in-house would leverage x86 Linux servers (running Ubuntu, Fedora, Centos,
or even OpenSuSE) which natively support KVM (linux based Virtualization
which is free) to host Hortonworks (Hadoop) leveraging the Unix
Administrators and Network Engineering monkey's to securing the Production
server which would have to be configured to accept public / external
connections (assuming you are hosting an application which is Software as a
Service for your customers) while not compromising security.

Have worked at 2 firms in the past several years where we did the very same
and it worked exceptionally well just takes time to get your IT staff up to
speed on Virtualization, Cluster administration and synchronizing Virtual
Machine backups with the promotion of releases from Dev to Test to
Production.


On Tue, May 21, 2013 at 11:11 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Is it possible to do Hadoop development on cloud in a secure and
> economical way without worrying about our source being taken away. We
> would like to have Hadoop and eclipse installed on a vm in cloud and our
> developers will log into the cloud on a daily basis and work on the cloud.
> Like this we r hoping if we develop any product we will minimize our source
> being taken away by the devs or others and is secure. Please let me know,
> any  suggestions  u may have.
> Thanks,
> Sai
>



-- 
Ellis R. Miller
937.829.2380

<http://my.wisestamp.com/link?u=2hxhdfd4p76bkhcm&site=www.wisestamp.com/email-install>

Mundo Nulla Fides


<http://my.wisestamp.com/link?u=gfbmwhzrwxzcrjqx&site=www.wisestamp.com/email-install>

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Amazon elastic cloud computer.
Pay per use

Thanks,
Rahul


On Wed, May 22, 2013 at 11:41 AM, Sai Sai <sa...@yahoo.in> wrote:

>
> Is it possible to do Hadoop development on cloud in a secure and
> economical way without worrying about our source being taken away. We
> would like to have Hadoop and eclipse installed on a vm in cloud and our
> developers will log into the cloud on a daily basis and work on the cloud.
> Like this we r hoping if we develop any product we will minimize our source
> being taken away by the devs or others and is secure. Please let me know,
> any  suggestions  u may have.
> Thanks,
> Sai
>

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Amazon elastic cloud computer.
Pay per use

Thanks,
Rahul


On Wed, May 22, 2013 at 11:41 AM, Sai Sai <sa...@yahoo.in> wrote:

>
> Is it possible to do Hadoop development on cloud in a secure and
> economical way without worrying about our source being taken away. We
> would like to have Hadoop and eclipse installed on a vm in cloud and our
> developers will log into the cloud on a daily basis and work on the cloud.
> Like this we r hoping if we develop any product we will minimize our source
> being taken away by the devs or others and is secure. Please let me know,
> any  suggestions  u may have.
> Thanks,
> Sai
>

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Amazon elastic cloud computer.
Pay per use

Thanks,
Rahul


On Wed, May 22, 2013 at 11:41 AM, Sai Sai <sa...@yahoo.in> wrote:

>
> Is it possible to do Hadoop development on cloud in a secure and
> economical way without worrying about our source being taken away. We
> would like to have Hadoop and eclipse installed on a vm in cloud and our
> developers will log into the cloud on a daily basis and work on the cloud.
> Like this we r hoping if we develop any product we will minimize our source
> being taken away by the devs or others and is secure. Please let me know,
> any  suggestions  u may have.
> Thanks,
> Sai
>

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Amazon elastic cloud computer.
Pay per use

Thanks,
Rahul


On Wed, May 22, 2013 at 11:41 AM, Sai Sai <sa...@yahoo.in> wrote:

>
> Is it possible to do Hadoop development on cloud in a secure and
> economical way without worrying about our source being taken away. We
> would like to have Hadoop and eclipse installed on a vm in cloud and our
> developers will log into the cloud on a daily basis and work on the cloud.
> Like this we r hoping if we develop any product we will minimize our source
> being taken away by the devs or others and is secure. Please let me know,
> any  suggestions  u may have.
> Thanks,
> Sai
>

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Ellis Miller <ou...@gmail.com>.
Configure private cloud: install VMWare / VirtualBox / KVM on internal
server / cluster and levearge either Cloudera Hadoop (free version) or
Hortonworks (Yahoo introduced Hortonworks and where Cloudera is exceptional
but proprietary  Hortonworks requires some configuration and tuning of
Hadoop in their promise to keep their Hadoop offering entirely open
source).

Both have Hadoop appliances which can be imported into VMWare / VirtualBox.
Couple this with, perhaps, using Amazon Web Services for production hosting
and all development can be done in-house with Eclipse and appropriate
Source Control Management.

With Amazon don't have to worry so much about your proprietary code being
stolen as the biggest concern (depending on your HA requirements) have been
several high-profile outages where their servers in VA (when you subscribe
to Amazon Web Services and create a Hadoop cluster, for example, it asks
you to specify the geographical region).

Nice thing about using Virtualization in-house for development, test,
integration environments your developers can code and perform initial
testing in Dev using VMWare / VirtualBox to then take a snapshot and export
the Dev environment then import into production. VWare and VirtualBox both
provide proprietary means to accelerate this process streamlining an Agile
/ Extreme Programming software development framework while simplifying the
release cycle.

Personally, if I were going to do it on the cheap and keep even production
in-house would leverage x86 Linux servers (running Ubuntu, Fedora, Centos,
or even OpenSuSE) which natively support KVM (linux based Virtualization
which is free) to host Hortonworks (Hadoop) leveraging the Unix
Administrators and Network Engineering monkey's to securing the Production
server which would have to be configured to accept public / external
connections (assuming you are hosting an application which is Software as a
Service for your customers) while not compromising security.

Have worked at 2 firms in the past several years where we did the very same
and it worked exceptionally well just takes time to get your IT staff up to
speed on Virtualization, Cluster administration and synchronizing Virtual
Machine backups with the promotion of releases from Dev to Test to
Production.


On Tue, May 21, 2013 at 11:11 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Is it possible to do Hadoop development on cloud in a secure and
> economical way without worrying about our source being taken away. We
> would like to have Hadoop and eclipse installed on a vm in cloud and our
> developers will log into the cloud on a daily basis and work on the cloud.
> Like this we r hoping if we develop any product we will minimize our source
> being taken away by the devs or others and is secure. Please let me know,
> any  suggestions  u may have.
> Thanks,
> Sai
>



-- 
Ellis R. Miller
937.829.2380

<http://my.wisestamp.com/link?u=2hxhdfd4p76bkhcm&site=www.wisestamp.com/email-install>

Mundo Nulla Fides


<http://my.wisestamp.com/link?u=gfbmwhzrwxzcrjqx&site=www.wisestamp.com/email-install>

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Ellis Miller <ou...@gmail.com>.
Configure private cloud: install VMWare / VirtualBox / KVM on internal
server / cluster and levearge either Cloudera Hadoop (free version) or
Hortonworks (Yahoo introduced Hortonworks and where Cloudera is exceptional
but proprietary  Hortonworks requires some configuration and tuning of
Hadoop in their promise to keep their Hadoop offering entirely open
source).

Both have Hadoop appliances which can be imported into VMWare / VirtualBox.
Couple this with, perhaps, using Amazon Web Services for production hosting
and all development can be done in-house with Eclipse and appropriate
Source Control Management.

With Amazon don't have to worry so much about your proprietary code being
stolen as the biggest concern (depending on your HA requirements) have been
several high-profile outages where their servers in VA (when you subscribe
to Amazon Web Services and create a Hadoop cluster, for example, it asks
you to specify the geographical region).

Nice thing about using Virtualization in-house for development, test,
integration environments your developers can code and perform initial
testing in Dev using VMWare / VirtualBox to then take a snapshot and export
the Dev environment then import into production. VWare and VirtualBox both
provide proprietary means to accelerate this process streamlining an Agile
/ Extreme Programming software development framework while simplifying the
release cycle.

Personally, if I were going to do it on the cheap and keep even production
in-house would leverage x86 Linux servers (running Ubuntu, Fedora, Centos,
or even OpenSuSE) which natively support KVM (linux based Virtualization
which is free) to host Hortonworks (Hadoop) leveraging the Unix
Administrators and Network Engineering monkey's to securing the Production
server which would have to be configured to accept public / external
connections (assuming you are hosting an application which is Software as a
Service for your customers) while not compromising security.

Have worked at 2 firms in the past several years where we did the very same
and it worked exceptionally well just takes time to get your IT staff up to
speed on Virtualization, Cluster administration and synchronizing Virtual
Machine backups with the promotion of releases from Dev to Test to
Production.


On Tue, May 21, 2013 at 11:11 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Is it possible to do Hadoop development on cloud in a secure and
> economical way without worrying about our source being taken away. We
> would like to have Hadoop and eclipse installed on a vm in cloud and our
> developers will log into the cloud on a daily basis and work on the cloud.
> Like this we r hoping if we develop any product we will minimize our source
> being taken away by the devs or others and is secure. Please let me know,
> any  suggestions  u may have.
> Thanks,
> Sai
>



-- 
Ellis R. Miller
937.829.2380

<http://my.wisestamp.com/link?u=2hxhdfd4p76bkhcm&site=www.wisestamp.com/email-install>

Mundo Nulla Fides


<http://my.wisestamp.com/link?u=gfbmwhzrwxzcrjqx&site=www.wisestamp.com/email-install>

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Ellis Miller <ou...@gmail.com>.
Configure private cloud: install VMWare / VirtualBox / KVM on internal
server / cluster and levearge either Cloudera Hadoop (free version) or
Hortonworks (Yahoo introduced Hortonworks and where Cloudera is exceptional
but proprietary  Hortonworks requires some configuration and tuning of
Hadoop in their promise to keep their Hadoop offering entirely open
source).

Both have Hadoop appliances which can be imported into VMWare / VirtualBox.
Couple this with, perhaps, using Amazon Web Services for production hosting
and all development can be done in-house with Eclipse and appropriate
Source Control Management.

With Amazon don't have to worry so much about your proprietary code being
stolen as the biggest concern (depending on your HA requirements) have been
several high-profile outages where their servers in VA (when you subscribe
to Amazon Web Services and create a Hadoop cluster, for example, it asks
you to specify the geographical region).

Nice thing about using Virtualization in-house for development, test,
integration environments your developers can code and perform initial
testing in Dev using VMWare / VirtualBox to then take a snapshot and export
the Dev environment then import into production. VWare and VirtualBox both
provide proprietary means to accelerate this process streamlining an Agile
/ Extreme Programming software development framework while simplifying the
release cycle.

Personally, if I were going to do it on the cheap and keep even production
in-house would leverage x86 Linux servers (running Ubuntu, Fedora, Centos,
or even OpenSuSE) which natively support KVM (linux based Virtualization
which is free) to host Hortonworks (Hadoop) leveraging the Unix
Administrators and Network Engineering monkey's to securing the Production
server which would have to be configured to accept public / external
connections (assuming you are hosting an application which is Software as a
Service for your customers) while not compromising security.

Have worked at 2 firms in the past several years where we did the very same
and it worked exceptionally well just takes time to get your IT staff up to
speed on Virtualization, Cluster administration and synchronizing Virtual
Machine backups with the promotion of releases from Dev to Test to
Production.


On Tue, May 21, 2013 at 11:11 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Is it possible to do Hadoop development on cloud in a secure and
> economical way without worrying about our source being taken away. We
> would like to have Hadoop and eclipse installed on a vm in cloud and our
> developers will log into the cloud on a daily basis and work on the cloud.
> Like this we r hoping if we develop any product we will minimize our source
> being taken away by the devs or others and is secure. Please let me know,
> any  suggestions  u may have.
> Thanks,
> Sai
>



-- 
Ellis R. Miller
937.829.2380

<http://my.wisestamp.com/link?u=2hxhdfd4p76bkhcm&site=www.wisestamp.com/email-install>

Mundo Nulla Fides


<http://my.wisestamp.com/link?u=gfbmwhzrwxzcrjqx&site=www.wisestamp.com/email-install>

Re: diff between these 2 dirs

Posted by Sai Sai <sa...@yahoo.in>.
Just wondering if someone can explain what is the diff between these 2 dirs:

Contents of directory /home/satish/work/mapred/staging/satish/.staging

and this dir:
/hadoop/mapred/system


Thanks
Sai

Re: diff between these 2 dirs

Posted by Sai Sai <sa...@yahoo.in>.
Just wondering if someone can explain what is the diff between these 2 dirs:

Contents of directory /home/satish/work/mapred/staging/satish/.staging

and this dir:
/hadoop/mapred/system


Thanks
Sai

Re: diff between these 2 dirs

Posted by Sai Sai <sa...@yahoo.in>.
Just wondering if someone can explain what is the diff between these 2 dirs:

Contents of directory /home/satish/work/mapred/staging/satish/.staging

and this dir:
/hadoop/mapred/system


Thanks
Sai

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Sai Sai <sa...@yahoo.in>.

Is it possible to do Hadoop development on cloud in a secure and economical way without worrying about our source being taken away. We would like to have Hadoop and eclipse installed on a vm in cloud and our developers will log into the cloud on a daily basis and work on the cloud. Like this we r hoping if we develop any product we will minimize our source being taken away by the devs or others and is secure. Please let me know, any  suggestions  u may have.
Thanks,
Sai

Re: Why/When partitioner is used.

Posted by Sai Sai <sa...@yahoo.in>.
I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Sai Sai <sa...@yahoo.in>.

Is it possible to do Hadoop development on cloud in a secure and economical way without worrying about our source being taken away. We would like to have Hadoop and eclipse installed on a vm in cloud and our developers will log into the cloud on a daily basis and work on the cloud. Like this we r hoping if we develop any product we will minimize our source being taken away by the devs or others and is secure. Please let me know, any  suggestions  u may have.
Thanks,
Sai

Re: Why/When partitioner is used.

Posted by Sai Sai <sa...@yahoo.in>.
I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Sai Sai <sa...@yahoo.in>.

Is it possible to do Hadoop development on cloud in a secure and economical way without worrying about our source being taken away. We would like to have Hadoop and eclipse installed on a vm in cloud and our developers will log into the cloud on a daily basis and work on the cloud. Like this we r hoping if we develop any product we will minimize our source being taken away by the devs or others and is secure. Please let me know, any  suggestions  u may have.
Thanks,
Sai

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Sai Sai <sa...@yahoo.in>.
Just wondering if anyone has any documentation or references to any articles how to simulate a multi node cluster setup in 1 laptop with hadoop running on multiple ubuntu VMs. any help is appreciated.
Thanks
Sai

Re: Why/When partitioner is used.

Posted by Sai Sai <sa...@yahoo.in>.
I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

Re: Project ideas

Posted by Juan Suero <ju...@gmail.com>.
im a newbie but maybe this will also add some value...
it is my understanding that mapreduce is like a distributed "group by"
statement

when you run a statement like this against your petabyes of dataset it can
take a long time.. first and foremost because the first thing you have to
do before you apply the group by logic is to read the data off disk.

if your disk reads at 100/MBs then you can do the math.
The time frame that this query will run take at least this long to complete.

If you need this info really fast like in the next hour to support i dunno
personalization features on a ecommerce site or month end report that needs
to be complete in 2 hours.
Then it would be nice to put equal parts of your data on 100s of disks and
run the same algorithm in parralel

but thats just if your bottleneck is disk.
what if your dataset is relatively small but calculations done on each
element coming in is large
so therefore your bottleneck there is CPU power

there are alot of bottlenecks you could run into.
number of threads
memory
latency of remote apis or remote database you hit as you analyze the data

Theres a book called programming collective intelligence from oreilly that
should help you out too
http://shop.oreilly.com/product/9780596529321.do



On Tue, May 21, 2013 at 11:02 PM, Sai Sai <sa...@yahoo.in> wrote:

> Excellent Sanjay, really excellent input. Many Thanks for this input.
> I have been always thinking about some ideas but never knowing what to
> proceed with.
> Thanks again.
> Sai
>
>   ------------------------------
>  *From:* Sanjay Subramanian <Sa...@wizecommerce.com>
> *To:* "user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent:* Tuesday, 21 May 2013 11:51 PM
> *Subject:* Re: Project ideas
>
>  +1
>
>  My $0.02 is look look around and see problems u can solve…Its better to
> get a list of problems and see if u can model a solution using map-reduce
> framework
>
>  An example is as follows
>
>  PROBLEM
> Build a Cars Pricing Model based on advertisements on Craigs list
>
>  OBJECTIVE
> Recommend a price to the Craigslist car seller when the user gives info
> about make,model,color,miles
>
>  DATA required
> Collect RSS feeds daily from Craigs List (don't pound their website , else
> they will lock u down)
>
>  DESIGN COMPONENTS
> - Daily RSS Collector - pulls data and puts into HDFS
> - Data Loader - Structures the columns u need to analyze and puts into HDFS
> - Hive Aggregator and analyzer - studies and queries data and brings out
> recommendation models for car pricing
> - REST Web service to return query results in XML/JSON
> - iPhone App that talks to web service and gets info
>
>  There u go…this should keep a couple of students busy for 3 months
>
>  I find this kind of problem statement and solutions simpler to
> understand because its all there in the real world !
>
>  An example of my way of thinking led to me founding this non profit
> called www.medicalsidefx.org that gives users valuable metrics regarding
> medical side fx.
> It uses Hadoop to aggregate , Lucene to search….This year I am redesigning
> the core to use Hive :-)
>
>  Good luck
>
>  Sanjay
>
>
>
>
>
>   From: Michael Segel <mi...@hotmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, May 21, 2013 6:46 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Project ideas
>
>  Drink heavily?
>
>  Sorry.
>
>  Let me rephrase.
>
>  Part of the exercise is for you, the student to come up with the idea.
> Not solicit someone else for a suggestion.  This is how you learn.
>
>  The exercise is to get you to think about the following:
>
>  1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
>
>  You need to understand #1 and #2 to be able to #3.
>
>  But at the same time... you need to also incorporate your own view of
> the world.
> What are your hobbies? What do you like to do?
> What scares you the most?  What excites you the most?
> Why are you here?
> And most importantly, what do you think you can do within the time period.
> (What data can you easily capture and work with...)
>
>  Have you ever seen 'Eden of the East' ? ;-)
>
>  HTH
>
>
>  On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>
>  Hello fellow users,
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
> Thanks in advance.
> Anshuman
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
>
>

Re: Project ideas

Posted by Juan Suero <ju...@gmail.com>.
im a newbie but maybe this will also add some value...
it is my understanding that mapreduce is like a distributed "group by"
statement

when you run a statement like this against your petabyes of dataset it can
take a long time.. first and foremost because the first thing you have to
do before you apply the group by logic is to read the data off disk.

if your disk reads at 100/MBs then you can do the math.
The time frame that this query will run take at least this long to complete.

If you need this info really fast like in the next hour to support i dunno
personalization features on a ecommerce site or month end report that needs
to be complete in 2 hours.
Then it would be nice to put equal parts of your data on 100s of disks and
run the same algorithm in parralel

but thats just if your bottleneck is disk.
what if your dataset is relatively small but calculations done on each
element coming in is large
so therefore your bottleneck there is CPU power

there are alot of bottlenecks you could run into.
number of threads
memory
latency of remote apis or remote database you hit as you analyze the data

Theres a book called programming collective intelligence from oreilly that
should help you out too
http://shop.oreilly.com/product/9780596529321.do



On Tue, May 21, 2013 at 11:02 PM, Sai Sai <sa...@yahoo.in> wrote:

> Excellent Sanjay, really excellent input. Many Thanks for this input.
> I have been always thinking about some ideas but never knowing what to
> proceed with.
> Thanks again.
> Sai
>
>   ------------------------------
>  *From:* Sanjay Subramanian <Sa...@wizecommerce.com>
> *To:* "user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent:* Tuesday, 21 May 2013 11:51 PM
> *Subject:* Re: Project ideas
>
>  +1
>
>  My $0.02 is look look around and see problems u can solve…Its better to
> get a list of problems and see if u can model a solution using map-reduce
> framework
>
>  An example is as follows
>
>  PROBLEM
> Build a Cars Pricing Model based on advertisements on Craigs list
>
>  OBJECTIVE
> Recommend a price to the Craigslist car seller when the user gives info
> about make,model,color,miles
>
>  DATA required
> Collect RSS feeds daily from Craigs List (don't pound their website , else
> they will lock u down)
>
>  DESIGN COMPONENTS
> - Daily RSS Collector - pulls data and puts into HDFS
> - Data Loader - Structures the columns u need to analyze and puts into HDFS
> - Hive Aggregator and analyzer - studies and queries data and brings out
> recommendation models for car pricing
> - REST Web service to return query results in XML/JSON
> - iPhone App that talks to web service and gets info
>
>  There u go…this should keep a couple of students busy for 3 months
>
>  I find this kind of problem statement and solutions simpler to
> understand because its all there in the real world !
>
>  An example of my way of thinking led to me founding this non profit
> called www.medicalsidefx.org that gives users valuable metrics regarding
> medical side fx.
> It uses Hadoop to aggregate , Lucene to search….This year I am redesigning
> the core to use Hive :-)
>
>  Good luck
>
>  Sanjay
>
>
>
>
>
>   From: Michael Segel <mi...@hotmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, May 21, 2013 6:46 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Project ideas
>
>  Drink heavily?
>
>  Sorry.
>
>  Let me rephrase.
>
>  Part of the exercise is for you, the student to come up with the idea.
> Not solicit someone else for a suggestion.  This is how you learn.
>
>  The exercise is to get you to think about the following:
>
>  1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
>
>  You need to understand #1 and #2 to be able to #3.
>
>  But at the same time... you need to also incorporate your own view of
> the world.
> What are your hobbies? What do you like to do?
> What scares you the most?  What excites you the most?
> Why are you here?
> And most importantly, what do you think you can do within the time period.
> (What data can you easily capture and work with...)
>
>  Have you ever seen 'Eden of the East' ? ;-)
>
>  HTH
>
>
>  On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:
>
>  Hello fellow users,
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
> Thanks in advance.
> Anshuman
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
>
>

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

Posted by Sai Sai <sa...@yahoo.in>.
Just wondering if anyone has any documentation or references to any articles how to simulate a multi node cluster setup in 1 laptop with hadoop running on multiple ubuntu VMs. any help is appreciated.
Thanks
Sai

Re: Hadoop Development on cloud in a secure and economical way.

Posted by Sai Sai <sa...@yahoo.in>.

Is it possible to do Hadoop development on cloud in a secure and economical way without worrying about our source being taken away. We would like to have Hadoop and eclipse installed on a vm in cloud and our developers will log into the cloud on a daily basis and work on the cloud. Like this we r hoping if we develop any product we will minimize our source being taken away by the devs or others and is secure. Please let me know, any  suggestions  u may have.
Thanks,
Sai

Re: Project ideas

Posted by Sai Sai <sa...@yahoo.in>.
Excellent Sanjay, really excellent input. Many Thanks for this input.
I have been always thinking about some ideas but never knowing what to proceed with.
Thanks again.
Sai


________________________________
 From: Sanjay Subramanian <Sa...@wizecommerce.com>
To: "user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Tuesday, 21 May 2013 11:51 PM
Subject: Re: Project ideas
 


+1 

My $0.02 is look look around and see problems u can solve…Its better to get a list of problems and see if u can model a solution using map-reduce framework 

An example is as follows

PROBLEM 
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they will lock u down) 

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand because its all there in the real world !

An example of my way of thinking led to me founding this non profit called www.medicalsidefx.org that gives users valuable metrics regarding medical side fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the core to use Hive :-) 

Good luck 

Sanjay

 


From: Michael Segel <mi...@hotmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Project ideas


Drink heavily?  

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH



On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:

Hello fellow users,
>We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?
>Thanks in advance.
>Anshuman


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review
 and disclosure by the sender's Email System Administrator.

Re: Project ideas

Posted by Sai Sai <sa...@yahoo.in>.
Excellent Sanjay, really excellent input. Many Thanks for this input.
I have been always thinking about some ideas but never knowing what to proceed with.
Thanks again.
Sai


________________________________
 From: Sanjay Subramanian <Sa...@wizecommerce.com>
To: "user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Tuesday, 21 May 2013 11:51 PM
Subject: Re: Project ideas
 


+1 

My $0.02 is look look around and see problems u can solve…Its better to get a list of problems and see if u can model a solution using map-reduce framework 

An example is as follows

PROBLEM 
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they will lock u down) 

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand because its all there in the real world !

An example of my way of thinking led to me founding this non profit called www.medicalsidefx.org that gives users valuable metrics regarding medical side fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the core to use Hive :-) 

Good luck 

Sanjay

 


From: Michael Segel <mi...@hotmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Project ideas


Drink heavily?  

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH



On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:

Hello fellow users,
>We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?
>Thanks in advance.
>Anshuman


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review
 and disclosure by the sender's Email System Administrator.

Re: Project ideas

Posted by Sai Sai <sa...@yahoo.in>.
Excellent Sanjay, really excellent input. Many Thanks for this input.
I have been always thinking about some ideas but never knowing what to proceed with.
Thanks again.
Sai


________________________________
 From: Sanjay Subramanian <Sa...@wizecommerce.com>
To: "user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Tuesday, 21 May 2013 11:51 PM
Subject: Re: Project ideas
 


+1 

My $0.02 is look look around and see problems u can solve…Its better to get a list of problems and see if u can model a solution using map-reduce framework 

An example is as follows

PROBLEM 
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they will lock u down) 

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand because its all there in the real world !

An example of my way of thinking led to me founding this non profit called www.medicalsidefx.org that gives users valuable metrics regarding medical side fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the core to use Hive :-) 

Good luck 

Sanjay

 


From: Michael Segel <mi...@hotmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Project ideas


Drink heavily?  

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH



On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:

Hello fellow users,
>We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?
>Thanks in advance.
>Anshuman


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review
 and disclosure by the sender's Email System Administrator.

Re: Project ideas

Posted by Sai Sai <sa...@yahoo.in>.
Excellent Sanjay, really excellent input. Many Thanks for this input.
I have been always thinking about some ideas but never knowing what to proceed with.
Thanks again.
Sai


________________________________
 From: Sanjay Subramanian <Sa...@wizecommerce.com>
To: "user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Tuesday, 21 May 2013 11:51 PM
Subject: Re: Project ideas
 


+1 

My $0.02 is look look around and see problems u can solve…Its better to get a list of problems and see if u can model a solution using map-reduce framework 

An example is as follows

PROBLEM 
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they will lock u down) 

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand because its all there in the real world !

An example of my way of thinking led to me founding this non profit called www.medicalsidefx.org that gives users valuable metrics regarding medical side fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the core to use Hive :-) 

Good luck 

Sanjay

 


From: Michael Segel <mi...@hotmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Project ideas


Drink heavily?  

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH



On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:

Hello fellow users,
>We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?
>Thanks in advance.
>Anshuman


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review
 and disclosure by the sender's Email System Administrator.

Re: Project ideas

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
+1

My $0.02 is look look around and see problems u can solve…Its better to get a list of problems and see if u can model a solution using map-reduce framework

An example is as follows

PROBLEM
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they will lock u down)

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand because its all there in the real world !

An example of my way of thinking led to me founding this non profit called www.medicalsidefx.org that gives users valuable metrics regarding medical side fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the core to use Hive :-)

Good luck

Sanjay





From: Michael Segel <mi...@hotmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Project ideas

Drink heavily?

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn.

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world.
What are your hobbies? What do you like to do?
What scares you the most?  What excites you the most?
Why are you here?
And most importantly, what do you think you can do within the time period.
(What data can you easily capture and work with...)

Have you ever seen 'Eden of the East' ? ;-)

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com>> wrote:


Hello fellow users,

We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?

Thanks in advance.

Anshuman


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Project ideas

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
+1

My $0.02 is look look around and see problems u can solve…Its better to get a list of problems and see if u can model a solution using map-reduce framework

An example is as follows

PROBLEM
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they will lock u down)

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand because its all there in the real world !

An example of my way of thinking led to me founding this non profit called www.medicalsidefx.org that gives users valuable metrics regarding medical side fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the core to use Hive :-)

Good luck

Sanjay





From: Michael Segel <mi...@hotmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Project ideas

Drink heavily?

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn.

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world.
What are your hobbies? What do you like to do?
What scares you the most?  What excites you the most?
Why are you here?
And most importantly, what do you think you can do within the time period.
(What data can you easily capture and work with...)

Have you ever seen 'Eden of the East' ? ;-)

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com>> wrote:


Hello fellow users,

We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?

Thanks in advance.

Anshuman


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Project ideas

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
+1

My $0.02 is look look around and see problems u can solve…Its better to get a list of problems and see if u can model a solution using map-reduce framework

An example is as follows

PROBLEM
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they will lock u down)

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand because its all there in the real world !

An example of my way of thinking led to me founding this non profit called www.medicalsidefx.org that gives users valuable metrics regarding medical side fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the core to use Hive :-)

Good luck

Sanjay





From: Michael Segel <mi...@hotmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Project ideas

Drink heavily?

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn.

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world.
What are your hobbies? What do you like to do?
What scares you the most?  What excites you the most?
Why are you here?
And most importantly, what do you think you can do within the time period.
(What data can you easily capture and work with...)

Have you ever seen 'Eden of the East' ? ;-)

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com>> wrote:


Hello fellow users,

We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?

Thanks in advance.

Anshuman


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Project ideas

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
+1

My $0.02 is look look around and see problems u can solve…Its better to get a list of problems and see if u can model a solution using map-reduce framework

An example is as follows

PROBLEM
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they will lock u down)

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand because its all there in the real world !

An example of my way of thinking led to me founding this non profit called www.medicalsidefx.org that gives users valuable metrics regarding medical side fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the core to use Hive :-)

Good luck

Sanjay





From: Michael Segel <mi...@hotmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Project ideas

Drink heavily?

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn.

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world.
What are your hobbies? What do you like to do?
What scares you the most?  What excites you the most?
Why are you here?
And most importantly, what do you think you can do within the time period.
(What data can you easily capture and work with...)

Have you ever seen 'Eden of the East' ? ;-)

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com>> wrote:


Hello fellow users,

We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?

Thanks in advance.

Anshuman


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Project ideas

Posted by Michael Segel <mi...@hotmail.com>.
Drink heavily? 

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:

> Hello fellow users,
> 
> We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?
> 
> Thanks in advance.
> 
> Anshuman
> 


Re: Project ideas

Posted by Michael Segel <mi...@hotmail.com>.
Drink heavily? 

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:

> Hello fellow users,
> 
> We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?
> 
> Thanks in advance.
> 
> Anshuman
> 


Re: Project ideas

Posted by Kun Ling <lk...@gmail.com>.
Hi  Anshuman,
   Since MR is like: split the input,  map it to different node, run it in
parallel, and combine the result.  I would suggest you look into the
application of the Divide-and-Conquer algorithms, and port it, or rewrite
it in Hadoop MapReduce.

yours,
Ling Kun


On Tue, May 21, 2013 at 9:35 PM, Anshuman Mathur <an...@gmail.com> wrote:

> Hello fellow users,
>
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
>
> Thanks in advance.
>
> Anshuman
>



-- 
http://www.lingcc.com

Re: Project ideas

Posted by Kun Ling <lk...@gmail.com>.
Hi  Anshuman,
   Since MR is like: split the input,  map it to different node, run it in
parallel, and combine the result.  I would suggest you look into the
application of the Divide-and-Conquer algorithms, and port it, or rewrite
it in Hadoop MapReduce.

yours,
Ling Kun


On Tue, May 21, 2013 at 9:35 PM, Anshuman Mathur <an...@gmail.com> wrote:

> Hello fellow users,
>
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
>
> Thanks in advance.
>
> Anshuman
>



-- 
http://www.lingcc.com

Re: Project ideas

Posted by Michael Segel <mi...@hotmail.com>.
Drink heavily? 

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:

> Hello fellow users,
> 
> We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?
> 
> Thanks in advance.
> 
> Anshuman
> 


Re: Project ideas

Posted by Kun Ling <lk...@gmail.com>.
Hi  Anshuman,
   Since MR is like: split the input,  map it to different node, run it in
parallel, and combine the result.  I would suggest you look into the
application of the Divide-and-Conquer algorithms, and port it, or rewrite
it in Hadoop MapReduce.

yours,
Ling Kun


On Tue, May 21, 2013 at 9:35 PM, Anshuman Mathur <an...@gmail.com> wrote:

> Hello fellow users,
>
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
>
> Thanks in advance.
>
> Anshuman
>



-- 
http://www.lingcc.com

Re: Project ideas

Posted by Kun Ling <lk...@gmail.com>.
Hi  Anshuman,
   Since MR is like: split the input,  map it to different node, run it in
parallel, and combine the result.  I would suggest you look into the
application of the Divide-and-Conquer algorithms, and port it, or rewrite
it in Hadoop MapReduce.

yours,
Ling Kun


On Tue, May 21, 2013 at 9:35 PM, Anshuman Mathur <an...@gmail.com> wrote:

> Hello fellow users,
>
> We are a group of students studying in National University of Singapore.
> As part of our course curriculum we need to develop an application using
> Hadoop and  map-reduce. Can you please suggest some innovative ideas for
> our project?
>
> Thanks in advance.
>
> Anshuman
>



-- 
http://www.lingcc.com

Re: Project ideas

Posted by Michael Segel <mi...@hotmail.com>.
Drink heavily? 

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <an...@gmail.com> wrote:

> Hello fellow users,
> 
> We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and  map-reduce. Can you please suggest some innovative ideas for our project?
> 
> Thanks in advance.
> 
> Anshuman
>