You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ricky Ho <rh...@adobe.com> on 2008/11/27 18:37:04 UTC

Best practices of using Hadoop

I am trying to get some answers to these kind of questions as they pop up frequently ...

1) What kind of problems fits best to Hadoop and what not ?

2) What is the dark side of Hadoop where other parallel processing model (e.g. MPI, TupleSpace ... etc) fits better ?

3) What is the demarcation point between choosing a Hadoop model versus a multi-thread share memory model ?

4) Given that we can partition and replicate a RDBMS table.  We can make it as big as we like and spread the workload across.  Why isn't that good enough for scalability ?  Why do we need BigTable or HBase which require an adoption of a new data model ?

5) Is there a general methodology that can transform any algorithm into the map/reduce form ?

6) How would one choose between Hadoop Java, Hadoop Streaming and PIG ?  Looks like if a problem can be solved in one, it can be solved in others.  If so, PIG is more attractive because it gives a higher level semantics.

I appreciate if anyone come across these decisions can share their thoughts.

Rgds,
ricky