You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by nijil <no...@yahoo.co.in> on 2010/02/02 17:43:22 UTC

help for hadoop begginer

i have read about basic stuff about hadoop..err i have a few doubts...mind u
am a begginer

1:so is hadoop a file sytem only?

2:can hbase be used instead of other databases in other platforms(eg java)?

3:what is mapreduce exactly and hw is it related to hadoop(i mean is it only
about parallel computing.....i dont understand how much paralell computing
is possible in a hadoop cloud sytem which is use for webhosting) .I require
some help on the topic on clubbing "Cloud Computing,Hadoop and
Webhosting"......please this is really important



4:Is hbase and hypertabe similar or is there a big difference

5:Can some one provide a map reduce implementation example other than
related to search engine.

6:How is mapreduce and hadoop related?

7:Can i learn hadoop with a "cloudera's Distribution for Hadoop" vmware
image..........

8:how is database synchronization done in hbase.....i belive hbase is a
distributed database

:can some one provide contact details for further help if u dont
mind........... :)
-- 
View this message in context: http://old.nabble.com/help-for-hadoop-begginer-tp27423435p27423435.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: help for hadoop begginer

Posted by Steve Loughran <st...@apache.org>.
nijil wrote:
> i have read about basic stuff about hadoop..err i have a few doubts...mind u
> am a begginer
> 
> 1:so is hadoop a file sytem only?
> 
> 2:can hbase be used instead of other databases in other platforms(eg java)?
> 
> 3:what is mapreduce exactly and hw is it related to hadoop(i mean is it only
> about parallel computing.....i dont understand how much paralell computing
> is possible in a hadoop cloud sytem which is use for webhosting) .I require
> some help on the topic on clubbing "Cloud Computing,Hadoop and
> Webhosting"......please this is really important
> 
> 
> 
> 4:Is hbase and hypertabe similar or is there a big difference
> 
> 5:Can some one provide a map reduce implementation example other than
> related to search engine.
> 
> 6:How is mapreduce and hadoop related?
> 
> 7:Can i learn hadoop with a "cloudera's Distribution for Hadoop" vmware
> image..........
> 
> 8:how is database synchronization done in hbase.....i belive hbase is a
> distributed database
> 
> :can some one provide contact details for further help if u dont
> mind........... :)

When does your homework have to be done by? Is there someone we could 
email it to direct so we'd get the credit the person answering the 
questions deserves?

-steve

Re: sort at reduce side

Posted by Edward Capriolo <ed...@gmail.com>.
2010/2/3 Srigurunath Chakravarthi <sr...@yahoo-inc.com>:
> Hi Gang,
>
>>kept in map file. If so, in order to efficiently sort the data, reducer
>>actually only read the index part of each spill (which is a map file) and
>>sort the keys, instead of reading whole records from disk and sort them.
>
>  afaik, no. Reduces always fetches map output data and not indexes (even if the data is from the local node, where an index may be sufficient).
>
> Regards,
> Sriguru
>
>>-----Original Message-----
>>From: Gang Luo [mailto:lgpublic@yahoo.com.cn]
>>Sent: Wednesday, February 03, 2010 10:40 AM
>>To: common-user@hadoop.apache.org
>>Subject: sort at reduce side
>>
>>Hi all,
>>I want to know some more details about the sorting at the reduce side.
>>
>>The intermediate result generated at the map side is stored as map file
>>which actually consists of two sub-files, namely index file and data file.
>>The index file stores the keys and it could point to corresponding record
>>stored in the data file.  What I think is that when intermediate result
>>(even only part of it for each mapper) is shuffled to reducer, it is still
>>kept in map file. If so, in order to efficiently sort the data, reducer
>>actually only read the index part of each spill (which is a map file) and
>>sort the keys, instead of reading whole records from disk and sort them.
>>
>>Does reducer actually do as what I expect?
>>
>>-Gang
>>
>>
>>      ___________________________________________________________
>>  好玩贺卡等你发,邮箱贺卡全新上线!
>>http://card.mail.cn.yahoo.com/
>

With .20 and the TotalOrderPartioner isn't reduce side sorting
possible now? Is that support we can/should add to hive?

RE: sort at reduce side

Posted by Srigurunath Chakravarthi <sr...@yahoo-inc.com>.
Hi Gang,

>kept in map file. If so, in order to efficiently sort the data, reducer
>actually only read the index part of each spill (which is a map file) and
>sort the keys, instead of reading whole records from disk and sort them. 

 afaik, no. Reduces always fetches map output data and not indexes (even if the data is from the local node, where an index may be sufficient).

Regards,
Sriguru

>-----Original Message-----
>From: Gang Luo [mailto:lgpublic@yahoo.com.cn]
>Sent: Wednesday, February 03, 2010 10:40 AM
>To: common-user@hadoop.apache.org
>Subject: sort at reduce side
>
>Hi all,
>I want to know some more details about the sorting at the reduce side.
>
>The intermediate result generated at the map side is stored as map file
>which actually consists of two sub-files, namely index file and data file.
>The index file stores the keys and it could point to corresponding record
>stored in the data file.  What I think is that when intermediate result
>(even only part of it for each mapper) is shuffled to reducer, it is still
>kept in map file. If so, in order to efficiently sort the data, reducer
>actually only read the index part of each spill (which is a map file) and
>sort the keys, instead of reading whole records from disk and sort them.
>
>Does reducer actually do as what I expect?
>
>-Gang
>
>
>      ___________________________________________________________
>  好玩贺卡等你发,邮箱贺卡全新上线!
>http://card.mail.cn.yahoo.com/

sort at reduce side

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi all,
I want to know some more details about the sorting at the reduce side. 

The intermediate result generated at the map side is stored as map file which actually consists of two sub-files, namely index file and data file. The index file stores the keys and it could point to corresponding record stored in the data file.  What I think is that when intermediate result (even only part of it for each mapper) is shuffled to reducer, it is still kept in map file. If so, in order to efficiently sort the data, reducer actually only read the index part of each spill (which is a map file) and sort the keys, instead of reading whole records from disk and sort them. 

Does reducer actually do as what I expect?

-Gang


      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Re: help for hadoop begginer

Posted by Kay Kay <ka...@gmail.com>.
On 2/2/10 9:02 AM, zaki rahaman wrote:
> Most of your questions are easily answered by taking a look at the
> documentation, FAQs, and some smart Googling/Yahooing/Binging.
>
>
> 1. The main Hadoop project consists of two major components: HDFS (Hadoop
> Distributed File System) and MapReduce.
>
> 2. Not sure what you mean by your second question.
>
> 3. MapReduce is simply a framework for doing distributed computation. And
> again, I don't understand what you mean by your "clubbing "Cloud
> Computing,Hadoop and
> Webhosting"......please this is really important"
>
> 4. From my understanding, Hbase and Hypertable are two different
> implementations/approaches to solving the problem of having a low-latency
> distributed table system similar to BigTable. One major difference is that
> Hbase is implemented in Java and built to work on top of HDFS. I don't know
> too much about Hypertable other than it's written in C++.
>
> 5. Again, Google is your friend. Most map/reduce implementations are pretty
> general purpose data processing tasks (aggregations, sorting, filters, etc.)
> that aren't specific to search.
>
> 6. See #1. MapReduce is one of the major components of the Hadoop Project
> and the solution to doing a lot of the data processing.
>
> 7. Yes, of course. THe VMware image that Cloudera provides is a good place
> to start. I would also watch their videos and presentations.
>
> 8. Again, I am not all that familiar with HBase but my understanding is that
> this is handled by the regionserver/ZooKeeper setup although I do not know
> the details.
>
> On Tue, Feb 2, 2010 at 11:43 AM, nijil<no...@yahoo.co.in>  wrote:
>
>    
>> i have read about basic stuff about hadoop..err i have a few doubts...mind
>> u
>> am a begginer
>>
>> 1:so is hadoop a file sytem only?
>>
>> 2:can hbase be used instead of other databases in other platforms(eg java)?
>>      
For more hbase related questions - please post to - 
hbase-user@hadoop.apache.org .

To add to what zaki had mentioned - hadoop project consists of the 
datastructure (HDFS , similar to GFS) and the algorithm ( MapReduce , 
based on Ghemawat et. al. in the public domain ).

HBase / Hypertable etc. come under the realm of column oriented 
databases. While it is tempting to use HBase in place of MySQL and 
suggest as an analogy - it is specifically meant for large scale data 
processing with high transactions , and hands-free architecture.  And 
the process involves unlearning a lot of concepts in the RDBMS world , 
to gain better throughput. By itself - HBase depends on distributed file 
system implementation, primarily HDFS in practice, ( although in theory 
it is possible to plug other DFS  implementations as well ).

HBase concerns itself only with the structured data representation and 
the failover mechanisms of the same, while delegating the storage of the 
actual data to a distributed file system ( HDFS, say).

To answer #8 ,  refer to the paxos algorithm for the theory and 
zookeeper , implements a variation of the same, that is used by HBase.

>> 3:what is mapreduce exactly and hw is it related to hadoop(i mean is it
>> only
>> about parallel computing.....i dont understand how much paralell computing
>> is possible in a hadoop cloud sytem which is use for webhosting) .I require
>> some help on the topic on clubbing "Cloud Computing,Hadoop and
>> Webhosting"......please this is really important
>>
>>
>>
>> 4:Is hbase and hypertabe similar or is there a big difference
>>      

They are 2 different approaches to solving the same problem.

>> 5:Can some one provide a map reduce implementation example other than
>> related to search engine.
>>      
Go through Amdahl's law to get an (theoretical at least , to begin with) 
estimate of the parallelism in the code.  M-R as a concept can be 
applied to the same.
Look for an example where NY Times had applied the M-R in EC2 to do some 
data intensive jobs.
>> 6:How is mapreduce and hadoop related?
>>      

There is a misconception about the hadoop terminology. A better question 
would be - how are mapreduce and hdfs related ? The former being an 
algorithm and the latter can be assumed to be a data structure, 
supporting the algorithm, complementary to the same. Of course - these 
are grossly over-simplified definitions , with the details spared here, 
but that helps with the terminology.

Hadoop, the original version of the project, had either of them packed 
together in a single distro, making the line blurry. But there are 
efforts underway to separate them conceptually under different trees, 
while maintaining the orthogonality between them.  So , hadoop refers to 
the eco-system altogether with specific modules addressing specific 
problems in the eco system.


>> 7:Can i learn hadoop with a "cloudera's Distribution for Hadoop" vmware
>> image..........
>>
>> 8:how is database synchronization done in hbase.....i belive hbase is a
>> distributed database
>>
>> :can some one provide contact details for further help if u dont
>> mind........... :)
>>      
mailing lists are your friend.  Of course - feel free to use them after 
appropriate homework.  Good luck !


>> --
>> View this message in context:
>> http://old.nabble.com/help-for-hadoop-begginer-tp27423435p27423435.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>>      
>
>    


Re: help for hadoop begginer

Posted by Kay Kay <ka...@gmail.com>.
On 2/2/10 9:02 AM, zaki rahaman wrote:
> Most of your questions are easily answered by taking a look at the
> documentation, FAQs, and some smart Googling/Yahooing/Binging.
>
>
> 1. The main Hadoop project consists of two major components: HDFS (Hadoop
> Distributed File System) and MapReduce.
>
> 2. Not sure what you mean by your second question.
>
> 3. MapReduce is simply a framework for doing distributed computation. And
> again, I don't understand what you mean by your "clubbing "Cloud
> Computing,Hadoop and
> Webhosting"......please this is really important"
>
> 4. From my understanding, Hbase and Hypertable are two different
> implementations/approaches to solving the problem of having a low-latency
> distributed table system similar to BigTable. One major difference is that
> Hbase is implemented in Java and built to work on top of HDFS. I don't know
> too much about Hypertable other than it's written in C++.
>
> 5. Again, Google is your friend. Most map/reduce implementations are pretty
> general purpose data processing tasks (aggregations, sorting, filters, etc.)
> that aren't specific to search.
>
> 6. See #1. MapReduce is one of the major components of the Hadoop Project
> and the solution to doing a lot of the data processing.
>
> 7. Yes, of course. THe VMware image that Cloudera provides is a good place
> to start. I would also watch their videos and presentations.
>
> 8. Again, I am not all that familiar with HBase but my understanding is that
> this is handled by the regionserver/ZooKeeper setup although I do not know
> the details.
>
> On Tue, Feb 2, 2010 at 11:43 AM, nijil<no...@yahoo.co.in>  wrote:
>
>    
>> i have read about basic stuff about hadoop..err i have a few doubts...mind
>> u
>> am a begginer
>>
>> 1:so is hadoop a file sytem only?
>>
>> 2:can hbase be used instead of other databases in other platforms(eg java)?
>>      
For more hbase related questions - please post to - 
hbase-user@hadoop.apache.org .

To add to what zaki had mentioned - hadoop project consists of the 
datastructure (HDFS , similar to GFS) and the algorithm ( MapReduce , 
based on Ghemawat et. al. in the public domain ).

HBase / Hypertable etc. come under the realm of column oriented 
databases. While it is tempting to use HBase in place of MySQL and 
suggest as an analogy - it is specifically meant for large scale data 
processing with high transactions , and hands-free architecture.  And 
the process involves unlearning a lot of concepts in the RDBMS world , 
to gain better throughput. By itself - HBase depends on distributed file 
system implementation, primarily HDFS in practice, ( although in theory 
it is possible to plug other DFS  implementations as well ).

HBase concerns itself only with the structured data representation and 
the failover mechanisms of the same, while delegating the storage of the 
actual data to a distributed file system ( HDFS, say).

To answer #8 ,  refer to the paxos algorithm for the theory and 
zookeeper , implements a variation of the same, that is used by HBase.

>> 3:what is mapreduce exactly and hw is it related to hadoop(i mean is it
>> only
>> about parallel computing.....i dont understand how much paralell computing
>> is possible in a hadoop cloud sytem which is use for webhosting) .I require
>> some help on the topic on clubbing "Cloud Computing,Hadoop and
>> Webhosting"......please this is really important
>>
>>
>>
>> 4:Is hbase and hypertabe similar or is there a big difference
>>      

They are 2 different approaches to solving the same problem.

>> 5:Can some one provide a map reduce implementation example other than
>> related to search engine.
>>      
Go through Amdahl's law to get an (theoretical at least , to begin with) 
estimate of the parallelism in the code.  M-R as a concept can be 
applied to the same.
Look for an example where NY Times had applied the M-R in EC2 to do some 
data intensive jobs.
>> 6:How is mapreduce and hadoop related?
>>      

There is a misconception about the hadoop terminology. A better question 
would be - how are mapreduce and hdfs related ? The former being an 
algorithm and the latter can be assumed to be a data structure, 
supporting the algorithm, complementary to the same. Of course - these 
are grossly over-simplified definitions , with the details spared here, 
but that helps with the terminology.

Hadoop, the original version of the project, had either of them packed 
together in a single distro, making the line blurry. But there are 
efforts underway to separate them conceptually under different trees, 
while maintaining the orthogonality between them.  So , hadoop refers to 
the eco-system altogether with specific modules addressing specific 
problems in the eco system.


>> 7:Can i learn hadoop with a "cloudera's Distribution for Hadoop" vmware
>> image..........
>>
>> 8:how is database synchronization done in hbase.....i belive hbase is a
>> distributed database
>>
>> :can some one provide contact details for further help if u dont
>> mind........... :)
>>      
mailing lists are your friend.  Of course - feel free to use them after 
appropriate homework.  Good luck !


>> --
>> View this message in context:
>> http://old.nabble.com/help-for-hadoop-begginer-tp27423435p27423435.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>>      
>
>    


Re: help for hadoop begginer

Posted by zaki rahaman <za...@gmail.com>.
Most of your questions are easily answered by taking a look at the
documentation, FAQs, and some smart Googling/Yahooing/Binging.


1. The main Hadoop project consists of two major components: HDFS (Hadoop
Distributed File System) and MapReduce.

2. Not sure what you mean by your second question.

3. MapReduce is simply a framework for doing distributed computation. And
again, I don't understand what you mean by your "clubbing "Cloud
Computing,Hadoop and
Webhosting"......please this is really important"

4. From my understanding, Hbase and Hypertable are two different
implementations/approaches to solving the problem of having a low-latency
distributed table system similar to BigTable. One major difference is that
Hbase is implemented in Java and built to work on top of HDFS. I don't know
too much about Hypertable other than it's written in C++.

5. Again, Google is your friend. Most map/reduce implementations are pretty
general purpose data processing tasks (aggregations, sorting, filters, etc.)
that aren't specific to search.

6. See #1. MapReduce is one of the major components of the Hadoop Project
and the solution to doing a lot of the data processing.

7. Yes, of course. THe VMware image that Cloudera provides is a good place
to start. I would also watch their videos and presentations.

8. Again, I am not all that familiar with HBase but my understanding is that
this is handled by the regionserver/ZooKeeper setup although I do not know
the details.

On Tue, Feb 2, 2010 at 11:43 AM, nijil <no...@yahoo.co.in> wrote:

>
> i have read about basic stuff about hadoop..err i have a few doubts...mind
> u
> am a begginer
>
> 1:so is hadoop a file sytem only?
>
> 2:can hbase be used instead of other databases in other platforms(eg java)?
>
> 3:what is mapreduce exactly and hw is it related to hadoop(i mean is it
> only
> about parallel computing.....i dont understand how much paralell computing
> is possible in a hadoop cloud sytem which is use for webhosting) .I require
> some help on the topic on clubbing "Cloud Computing,Hadoop and
> Webhosting"......please this is really important
>
>
>
> 4:Is hbase and hypertabe similar or is there a big difference
>
> 5:Can some one provide a map reduce implementation example other than
> related to search engine.
>
> 6:How is mapreduce and hadoop related?
>
> 7:Can i learn hadoop with a "cloudera's Distribution for Hadoop" vmware
> image..........
>
> 8:how is database synchronization done in hbase.....i belive hbase is a
> distributed database
>
> :can some one provide contact details for further help if u dont
> mind........... :)
> --
> View this message in context:
> http://old.nabble.com/help-for-hadoop-begginer-tp27423435p27423435.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Zaki Rahaman