You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stas Oskin <st...@gmail.com> on 2008/09/23 21:50:44 UTC

Hadoop for real time

Hi.

Is it possible to use Hadoop for real-time app, in video processing field?

Regards.

Re: Hadoop for real time

Posted by Stas Oskin <st...@gmail.com>.

Hi Ted.

Thanks for sharing some of inner workings of Veoh, which btw I'm a frequent
user of (or at least when time permits :) ).

I indeed recall reading somewhere that Veoh used a heavily modified version
of MogileFS, but have switched since as it wasn't ready enough for Veoh
needs.

If not Hadoop, are there any other available solutions that can assist in
distributing the processing of real-time video data? Or the old way of
separate application severs is the only way to go?

Regards.



2008/10/20 Ted Dunning <te...@gmail.com>

> Hadoop may not be quite what you want for this.
>
> You could definitely use Hadopo for storage and streaming.  You can also do
> various kinds of processing on hadoop.
>
> But because Hadoop is primarily intended for batch style operations, there
> is a bit of an assumption that some administrative tasks will take down the
> cluster.  That may be a problem (video serving tends to have a web audience
> that isn't very tolerant of downtime).
>
> At Veoh, we used a simpler, but simpler system for serving videos that was
> originally based on Mogile.  The basic idea is that there is a database
> that
> contains name to URL mappings.  The URL's point to storage boxes that have
> a
> bunch of disks that are served out to the net via LightHttpd.  A management
> machine runs occasionally to make sure that files are replicated according
> to policy.  The database is made redundant via conventional mechanisms.
> Requests for files can be proxied a farm of front end machines that query
> the database for locations or you can use redirects directly to the
> content.  How you do it depends on network topology and your sensitivity
> about divulging internal details.  Redirects can give higher peak read
> speed
> since you are going direct.  Proxying avoids a network round trip for the
> redirect.
>
> At Veoh, this system fed the content delivery networks as a caching layer
> which meant that the traffic was essentially uniform random access.  This
> system handled a huge number of files (10^9 or so) very easily and has
> essentially never had customer visible downtime.  Extension with new files
> systems is trivial (just tell the manager box and it starts using them).
>
> This arrangement lacks most of the things that make Hadoop really good for
> what it does.  But, in return, it is incredibly simple.  It isn't very
> suitable for map-reduce or other high bandwidth processing tasks.  It
> doesn't allow computation to go to the data.  It doesn't allow large files
> to be read in parallel from many machines.  On the other hand, it handles
> way more files than Hadoop does and it handles gobs of tiny files pretty
> well.
>
> Video is also kind of a write-once medium in many cases and video files
> aren't real splittable for map-reduce purposes.  That might mean that you
> could get away with a mogile-ish system.
>
> On Tue, Oct 14, 2008 at 1:29 PM, Stas Oskin <st...@gmail.com> wrote:
>
> > Hi.
> >
> > Video storage, processing and streaming.
> >
> > Regards.
> >
> > 2008/9/25 Edward J. Yoon <ed...@apache.org>
> >
> > > What kind of the real-time app?
> > >
> > > On Wed, Sep 24, 2008 at 4:50 AM, Stas Oskin <st...@gmail.com>
> > wrote:
> > > > Hi.
> > > >
> > > > Is it possible to use Hadoop for real-time app, in video processing
> > > field?
> > > >
> > > > Regards.
> > > >
> > >
> > > --
> > > Best regards, Edward J. Yoon
> > > edwardyoon@apache.org
> > > http://blog.udanax.org
> > >
> >
>
>
>
> --
> ted
>

Re: Hadoop for real time

Posted by Ted Dunning <te...@gmail.com>.

Hadoop may not be quite what you want for this.

You could definitely use Hadopo for storage and streaming.  You can also do
various kinds of processing on hadoop.

But because Hadoop is primarily intended for batch style operations, there
is a bit of an assumption that some administrative tasks will take down the
cluster.  That may be a problem (video serving tends to have a web audience
that isn't very tolerant of downtime).

At Veoh, we used a simpler, but simpler system for serving videos that was
originally based on Mogile.  The basic idea is that there is a database that
contains name to URL mappings.  The URL's point to storage boxes that have a
bunch of disks that are served out to the net via LightHttpd.  A management
machine runs occasionally to make sure that files are replicated according
to policy.  The database is made redundant via conventional mechanisms.
Requests for files can be proxied a farm of front end machines that query
the database for locations or you can use redirects directly to the
content.  How you do it depends on network topology and your sensitivity
about divulging internal details.  Redirects can give higher peak read speed
since you are going direct.  Proxying avoids a network round trip for the
redirect.

At Veoh, this system fed the content delivery networks as a caching layer
which meant that the traffic was essentially uniform random access.  This
system handled a huge number of files (10^9 or so) very easily and has
essentially never had customer visible downtime.  Extension with new files
systems is trivial (just tell the manager box and it starts using them).

This arrangement lacks most of the things that make Hadoop really good for
what it does.  But, in return, it is incredibly simple.  It isn't very
suitable for map-reduce or other high bandwidth processing tasks.  It
doesn't allow computation to go to the data.  It doesn't allow large files
to be read in parallel from many machines.  On the other hand, it handles
way more files than Hadoop does and it handles gobs of tiny files pretty
well.

Video is also kind of a write-once medium in many cases and video files
aren't real splittable for map-reduce purposes.  That might mean that you
could get away with a mogile-ish system.

On Tue, Oct 14, 2008 at 1:29 PM, Stas Oskin <st...@gmail.com> wrote:

> Hi.
>
> Video storage, processing and streaming.
>
> Regards.
>
> 2008/9/25 Edward J. Yoon <ed...@apache.org>
>
> > What kind of the real-time app?
> >
> > On Wed, Sep 24, 2008 at 4:50 AM, Stas Oskin <st...@gmail.com>
> wrote:
> > > Hi.
> > >
> > > Is it possible to use Hadoop for real-time app, in video processing
> > field?
> > >
> > > Regards.
> > >
> >
> > --
> > Best regards, Edward J. Yoon
> > edwardyoon@apache.org
> > http://blog.udanax.org
> >
>

-- 
ted

Re: Are There Books of Hadoop/Pig?

Posted by "Amit k. Saha" <am...@gmail.com>.

On Wed, Oct 15, 2008 at 4:10 AM, Steve Gao <st...@yahoo.com> wrote:
> Does anybody know if there are books about hadoop or pig? The wiki and manual are kind of ad-hoc and hard to comprehend, for example "I want to know how to apply patchs to my Hadoop, but can't find how to do it" that kind of things.
>
> Would anybody help? Thanks.

http://oreilly.com/catalog/9780596521998/

HTH,
Amit
>
>
>
>



-- 
Amit Kumar Saha
http://blogs.sun.com/amitsaha/
http://amitsaha.in.googlepages.com/
Skype: amitkumarsaha

Re: Need reboot the whole system if adding new datanodes?

Posted by Steve Loughran <st...@apache.org>.

Amit k. Saha wrote:
> On Wed, Oct 15, 2008 at 9:09 AM, David Wei <we...@kingsoft.com> wrote:
>> It seems that we need to restart the whole hadoop system in order to add new
>> nodes inside the cluster. Any solution for us that no need for the
>> rebooting?
> 
> From what I know so far, you have to start the HDFS dameon (which
> reads the 'slaves' file) to 'let it know' which are the data nodes. So
> everytime you add a new DataNode, I believe you will have to restarted
> the daemon, which is like re-initiating the NameNode.
> 

You don't need a slaves file; you can connect to a namenode without it. 
So: no need to restart daemons. What you should do is decommission 
datanodes, to shut them down cleanly and make sure all data is copied 
off them, when taking them away deliberately. If you just kill it, the 
namenode will notice, but some data may be underreplicated.

-steve

Re: Need reboot the whole system if adding new datanodes?

Posted by Paul <pa...@gmail.com>.

As long as the new node is in the slaves file on the master, just do a  
start-all.sh and it will attempt to start everything.  Nodes that are  
already running will keep running and new nodes will be started.

Consider doing a rebalance after adding a new node for better  
distribution.



-paul

On Oct 15, 2008, at 1:55 AM, "Amit k. Saha" <am...@gmail.com>  
wrote:

> On Wed, Oct 15, 2008 at 9:09 AM, David Wei <we...@kingsoft.com>  
> wrote:
>> It seems that we need to restart the whole hadoop system in order  
>> to add new
>> nodes inside the cluster. Any solution for us that no need for the
>> rebooting?
>
> From what I know so far, you have to start the HDFS dameon (which
> reads the 'slaves' file) to 'let it know' which are the data nodes. So
> everytime you add a new DataNode, I believe you will have to restarted
> the daemon, which is like re-initiating the NameNode.
>
> Hope I am not very wrong :-)
>
> Best,
> Amit
>
> -- 
> Amit Kumar Saha
> http://blogs.sun.com/amitsaha/
> http://amitsaha.in.googlepages.com/
> Skype: amitkumarsaha

Re: Need reboot the whole system if adding new datanodes?

Posted by Prasad Pingali <pv...@research.iiit.ac.in>.

you can use the hadoop-daemon.sh script provided in bin folder. The following 
will be the steps.

In the new machine to be added,
1.) ensure hadoop config is pointing to the right namenode.
2.) run bin/hadoop-daemon.sh start datanode

this should add datanode without needing a restart of complete cluster.

- Prasad. 

On Wednesday 15 October 2008 11:25:29 am Amit k. Saha wrote:
> On Wed, Oct 15, 2008 at 9:09 AM, David Wei <we...@kingsoft.com> wrote:
> > It seems that we need to restart the whole hadoop system in order to add
> > new nodes inside the cluster. Any solution for us that no need for the
> > rebooting?
> >
> >From what I know so far, you have to start the HDFS dameon (which
>
> reads the 'slaves' file) to 'let it know' which are the data nodes. So
> everytime you add a new DataNode, I believe you will have to restarted
> the daemon, which is like re-initiating the NameNode.
>
> Hope I am not very wrong :-)
>
> Best,
> Amit

Re: Need reboot the whole system if adding new datanodes?

Posted by "Amit k. Saha" <am...@gmail.com>.

On Wed, Oct 15, 2008 at 9:09 AM, David Wei <we...@kingsoft.com> wrote:
> It seems that we need to restart the whole hadoop system in order to add new
> nodes inside the cluster. Any solution for us that no need for the
> rebooting?

>From what I know so far, you have to start the HDFS dameon (which
reads the 'slaves' file) to 'let it know' which are the data nodes. So
everytime you add a new DataNode, I believe you will have to restarted
the daemon, which is like re-initiating the NameNode.

Hope I am not very wrong :-)

Best,
Amit

-- 
Amit Kumar Saha
http://blogs.sun.com/amitsaha/
http://amitsaha.in.googlepages.com/
Skype: amitkumarsaha

Need reboot the whole system if adding new datanodes?

Posted by David Wei <we...@kingsoft.com>.

It seems that we need to restart the whole hadoop system in order to add 
new nodes inside the cluster. Any solution for us that no need for the 
rebooting?

PS: We just had one namenode in the cluster

Thx!

David

Need reboot the whole system if adding new datanodes?

Posted by David Wei <we...@kingsoft.com>.

It seems that we need to restart the whole hadoop system in order to add 
new nodes inside the cluster. Any solution for us that no need for the 
rebooting?

PS: We just had one namenode in the cluster

Thx!

David

Are There Books of Hadoop/Pig?

Posted by Steve Gao <st...@yahoo.com>.

Does anybody know if there are books about hadoop or pig? The wiki and manual are kind of ad-hoc and hard to comprehend, for example "I want to know how to apply patchs to my Hadoop, but can't find how to do it" that kind of things.

Would anybody help? Thanks.

Are There Books of Hadoop/Pig?

Posted by Steve Gao <st...@yahoo.com>.

Does anybody know if there are books about hadoop or pig? The wiki and manual are kind of ad-hoc and hard to comprehend, for example "I want to know how to apply patchs to my Hadoop, but can't find how to do it" that kind of things.

Would anybody help? Thanks.

Re: Hadoop for real time

Posted by Stas Oskin <st...@gmail.com>.

Hi.

Video storage, processing and streaming.

Regards.

2008/9/25 Edward J. Yoon <ed...@apache.org>

> What kind of the real-time app?
>
> On Wed, Sep 24, 2008 at 4:50 AM, Stas Oskin <st...@gmail.com> wrote:
> > Hi.
> >
> > Is it possible to use Hadoop for real-time app, in video processing
> field?
> >
> > Regards.
> >
>
> --
> Best regards, Edward J. Yoon
> edwardyoon@apache.org
> http://blog.udanax.org
>

Re: Hadoop for real time

Posted by "Edward J. Yoon" <ed...@apache.org>.

What kind of the real-time app?

On Wed, Sep 24, 2008 at 4:50 AM, Stas Oskin <st...@gmail.com> wrote:
> Hi.
>
> Is it possible to use Hadoop for real-time app, in video processing field?
>
> Regards.
>

-- 
Best regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org