You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by "ados1984@gmail.com" <ad...@gmail.com> on 2014/03/12 20:07:46 UTC

Use Cases for Structured Data

Hello Team,

I am starting off on Hadoop eco-system and wanted to learn first based on
my use case if Hadoop is right tool for me.

I have only structured data and my goal is to safe this data into Hadoop
and take benefit of replication factor. I am using Microsoft tools for
doing analysis and it provides me with good drag and drop functionality for
creating different kind of anaylsis and also it has hadoop drivers so it
can have hadoop as data source for doing analysis.

My question here is how benefits YARN architecture give me in tems of
analysis that my Microsoft, Netezza of Tableau products are not giving me.
I am just trying to understand value of introducing Hadoop in my
Architecture in terms of Analysis apart from data replication. Any insights
would be very helpful.

Also, my goal for POC is related to efficient data storage/retrieval and so

   1. how does data retrieval work in hadoop?
   2. do i always need to have any kind of data source on top of hdfs like
   hbase/cassandra/mongo or there is not need for one and i can have all my
   data stored in hdfs directly and can retrieve them when i need by using
   different analytic tools that have hdfs as data source?
   3. say if i have 3 node cluster, one master and 2 slaves and if am
   trying to insert data into hadoop then what is the cycle that framework
   performs to install my data into hdfs - does my process reads all meta data
   information from master node about where is my slaves nodes and what kind
   of data should go on which slave node or all data is send to master node
   and from there depending upon meta data information it reads and decides
   that what portion of data should be going to which node?
   4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
   data is equally distributed in two nodes and if i have replication set to 2
   then where and how will replication take place as i do not have any node
   vacant for doing replication?
   5. Also, for POC, does it make sense to go with Cloudera 3 node free
   cluster or Hortonworks 3 node free cluster or it makes sense to go with
   opensource hadoop version and if we go with open source hadoop version then
   where can we define that which is master node and which is slave node and
   also can we have all 3 nodes on same machine or we need to have all 3 nodes
   on different machines?
   6. Also, what are the pros and cons with going through
   Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
   view?
   7. Also, if we go with Hortonworks/Cloudera then what all tools are come
   clubbed together with Hadoop framework and if we go with Apache Hadoop, do
   we get any tools like Pig, Hive clubbed together or we have to install them
   separately?

Since am staring off on Hadoop Journey recently, I would really appreciate
if community can point me in right direction?

Regards, Andy.

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Thanks D, that certainly answers my question.

I was just taking quick look at Hortonworks HDP vs Hortonworks Sandbox, do
you know of any benefits of using Sandbox as opposed to Hortonworks Data
Platforms?


On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com> wrote:

> Hi,
>
> 1) HDFS is just a file system, it hides the fact that it is distributed.
> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
> specify an input and in your map and reduce function define some
> functionality to deal with this input. No need for HBase,... although they
> can be extremely useful..
> 3) this is all in the hadoop reference: first the namenode finds a place
> to allocate your data, then it gets copied to the corresponding datanode 1,
> and from datanode 1 it is copied to datanode 2 (note the numbers have no
> special meaning)
> 4) Your data will be on both datanodes. Why would that be a problem?
> 5) For a proof of concept I would use a ready-made virtual machine from
> one of the three big vendors: cloudera, mapR or hortonworks
> 6) Apache version is more basic, the commercial distributions have more
> built-in features, are easier to work with I guess
> 7) You have to install them seperately, the main reason to go for one of
> the vendors maybe?
>
> You should defintely have a look at the reference, you don't have to read
> it from A-Z but it contains sections where every single sentence will
> answer one of your questions..
>
> Regards, D
>
>
>
> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Thank you Shahab but it would be really nice if I can get some input on my
>> initial question as it would really help.
>>
>>
>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I would suggest that given the level of details that you are looking for
>>> and fundamental nature of your questions, you should get hold of books or
>>> online documentation. Basically some reading/research.
>>>
>>> Latest edition of
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>>
>>>> Hello Team,
>>>>
>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>> on my use case if Hadoop is right tool for me.
>>>>
>>>> I have only structured data and my goal is to safe this data into
>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>> for doing analysis and it provides me with good drag and drop functionality
>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>> it can have hadoop as data source for doing analysis.
>>>>
>>>> My question here is how benefits YARN architecture give me in tems of
>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>> I am just trying to understand value of introducing Hadoop in my
>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>> would be very helpful.
>>>>
>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>> and so
>>>>
>>>>    1. how does data retrieval work in hadoop?
>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>    different analytic tools that have hdfs as data source?
>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>    information from master node about where is my slaves nodes and what kind
>>>>    of data should go on which slave node or all data is send to master node
>>>>    and from there depending upon meta data information it reads and decides
>>>>    that what portion of data should be going to which node?
>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>    to 2 then where and how will replication take place as i do not have any
>>>>    node vacant for doing replication?
>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>    then where can we define that which is master node and which is slave node
>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>    nodes on different machines?
>>>>    6. Also, what are the pros and cons with going through
>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>    view?
>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>    install them separately?
>>>>
>>>> Since am staring off on Hadoop Journey recently, I would really
>>>> appreciate if community can point me in right direction?
>>>>
>>>> Regards, Andy.
>>>>
>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Thanks D, that certainly answers my question.

I was just taking quick look at Hortonworks HDP vs Hortonworks Sandbox, do
you know of any benefits of using Sandbox as opposed to Hortonworks Data
Platforms?


On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com> wrote:

> Hi,
>
> 1) HDFS is just a file system, it hides the fact that it is distributed.
> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
> specify an input and in your map and reduce function define some
> functionality to deal with this input. No need for HBase,... although they
> can be extremely useful..
> 3) this is all in the hadoop reference: first the namenode finds a place
> to allocate your data, then it gets copied to the corresponding datanode 1,
> and from datanode 1 it is copied to datanode 2 (note the numbers have no
> special meaning)
> 4) Your data will be on both datanodes. Why would that be a problem?
> 5) For a proof of concept I would use a ready-made virtual machine from
> one of the three big vendors: cloudera, mapR or hortonworks
> 6) Apache version is more basic, the commercial distributions have more
> built-in features, are easier to work with I guess
> 7) You have to install them seperately, the main reason to go for one of
> the vendors maybe?
>
> You should defintely have a look at the reference, you don't have to read
> it from A-Z but it contains sections where every single sentence will
> answer one of your questions..
>
> Regards, D
>
>
>
> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Thank you Shahab but it would be really nice if I can get some input on my
>> initial question as it would really help.
>>
>>
>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I would suggest that given the level of details that you are looking for
>>> and fundamental nature of your questions, you should get hold of books or
>>> online documentation. Basically some reading/research.
>>>
>>> Latest edition of
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>>
>>>> Hello Team,
>>>>
>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>> on my use case if Hadoop is right tool for me.
>>>>
>>>> I have only structured data and my goal is to safe this data into
>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>> for doing analysis and it provides me with good drag and drop functionality
>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>> it can have hadoop as data source for doing analysis.
>>>>
>>>> My question here is how benefits YARN architecture give me in tems of
>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>> I am just trying to understand value of introducing Hadoop in my
>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>> would be very helpful.
>>>>
>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>> and so
>>>>
>>>>    1. how does data retrieval work in hadoop?
>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>    different analytic tools that have hdfs as data source?
>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>    information from master node about where is my slaves nodes and what kind
>>>>    of data should go on which slave node or all data is send to master node
>>>>    and from there depending upon meta data information it reads and decides
>>>>    that what portion of data should be going to which node?
>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>    to 2 then where and how will replication take place as i do not have any
>>>>    node vacant for doing replication?
>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>    then where can we define that which is master node and which is slave node
>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>    nodes on different machines?
>>>>    6. Also, what are the pros and cons with going through
>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>    view?
>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>    install them separately?
>>>>
>>>> Since am staring off on Hadoop Journey recently, I would really
>>>> appreciate if community can point me in right direction?
>>>>
>>>> Regards, Andy.
>>>>
>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

okies, thank you D, i will start playing around with the Sandbox version.




On Thu, Mar 13, 2014 at 5:55 AM, Dieter De Witte <dr...@gmail.com> wrote:

> Sandbox is just meant to be a learning environment i guess, to see what's
> possible, how things can be connected. The real distribution will have much
> higher performance and is the one you need when you want to investigate
> performance issues. The only real drawback of the real distributions is
> that they take more time to get you started when you sometimes just want to
> play around..
>
>
> 2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Hey D,
>> Regarding your point 5: "For a proof of concept I would use a ready-made
>> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>>
>> I want to understand how this virtual setup would work and how much
>> master and slaves nodes I can have in this virtual setup and in general
>> what are differences between the actual Hadoop Distribution to this virtual
>> ready made setups?
>>
>> Regards, Andy.
>>
>>
>>
>> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>>> specify an input and in your map and reduce function define some
>>> functionality to deal with this input. No need for HBase,... although they
>>> can be extremely useful..
>>> 3) this is all in the hadoop reference: first the namenode finds a place
>>> to allocate your data, then it gets copied to the corresponding datanode 1,
>>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>>> special meaning)
>>> 4) Your data will be on both datanodes. Why would that be a problem?
>>> 5) For a proof of concept I would use a ready-made virtual machine from
>>> one of the three big vendors: cloudera, mapR or hortonworks
>>> 6) Apache version is more basic, the commercial distributions have more
>>> built-in features, are easier to work with I guess
>>> 7) You have to install them seperately, the main reason to go for one of
>>> the vendors maybe?
>>>
>>> You should defintely have a look at the reference, you don't have to
>>> read it from A-Z but it contains sections where every single sentence will
>>> answer one of your questions..
>>>
>>> Regards, D
>>>
>>>
>>>
>>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>>>
>>> Thank you Shahab but it would be really nice if I can get some input on
>>>> my initial question as it would really help.
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>>
>>>>> I would suggest that given the level of details that you are looking
>>>>> for and fundamental nature of your questions, you should get hold of books
>>>>> or online documentation. Basically some reading/research.
>>>>>
>>>>> Latest edition of
>>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <
>>>>> ados1984@gmail.com> wrote:
>>>>>
>>>>>> Hello Team,
>>>>>>
>>>>>> I am starting off on Hadoop eco-system and wanted to learn first
>>>>>> based on my use case if Hadoop is right tool for me.
>>>>>>
>>>>>> I have only structured data and my goal is to safe this data into
>>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>>> it can have hadoop as data source for doing analysis.
>>>>>>
>>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>>>> would be very helpful.
>>>>>>
>>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>>> and so
>>>>>>
>>>>>>    1. how does data retrieval work in hadoop?
>>>>>>    2. do i always need to have any kind of data source on top of
>>>>>>    hdfs like hbase/cassandra/mongo or there is not need for one and i can have
>>>>>>    all my data stored in hdfs directly and can retrieve them when i need by
>>>>>>    using different analytic tools that have hdfs as data source?
>>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if
>>>>>>    am trying to insert data into hadoop then what is the cycle that framework
>>>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>>>    information from master node about where is my slaves nodes and what kind
>>>>>>    of data should go on which slave node or all data is send to master node
>>>>>>    and from there depending upon meta data information it reads and decides
>>>>>>    that what portion of data should be going to which node?
>>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and
>>>>>>    if my data is equally distributed in two nodes and if i have replication
>>>>>>    set to 2 then where and how will replication take place as i do not have
>>>>>>    any node vacant for doing replication?
>>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>>>    then where can we define that which is master node and which is slave node
>>>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>>>    nodes on different machines?
>>>>>>    6. Also, what are the pros and cons with going through
>>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>>>    view?
>>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>>    are come clubbed together with Hadoop framework and if we go with Apache
>>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>>>    install them separately?
>>>>>>
>>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>>> appreciate if community can point me in right direction?
>>>>>>
>>>>>> Regards, Andy.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

okies, thank you D, i will start playing around with the Sandbox version.




On Thu, Mar 13, 2014 at 5:55 AM, Dieter De Witte <dr...@gmail.com> wrote:

> Sandbox is just meant to be a learning environment i guess, to see what's
> possible, how things can be connected. The real distribution will have much
> higher performance and is the one you need when you want to investigate
> performance issues. The only real drawback of the real distributions is
> that they take more time to get you started when you sometimes just want to
> play around..
>
>
> 2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Hey D,
>> Regarding your point 5: "For a proof of concept I would use a ready-made
>> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>>
>> I want to understand how this virtual setup would work and how much
>> master and slaves nodes I can have in this virtual setup and in general
>> what are differences between the actual Hadoop Distribution to this virtual
>> ready made setups?
>>
>> Regards, Andy.
>>
>>
>>
>> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>>> specify an input and in your map and reduce function define some
>>> functionality to deal with this input. No need for HBase,... although they
>>> can be extremely useful..
>>> 3) this is all in the hadoop reference: first the namenode finds a place
>>> to allocate your data, then it gets copied to the corresponding datanode 1,
>>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>>> special meaning)
>>> 4) Your data will be on both datanodes. Why would that be a problem?
>>> 5) For a proof of concept I would use a ready-made virtual machine from
>>> one of the three big vendors: cloudera, mapR or hortonworks
>>> 6) Apache version is more basic, the commercial distributions have more
>>> built-in features, are easier to work with I guess
>>> 7) You have to install them seperately, the main reason to go for one of
>>> the vendors maybe?
>>>
>>> You should defintely have a look at the reference, you don't have to
>>> read it from A-Z but it contains sections where every single sentence will
>>> answer one of your questions..
>>>
>>> Regards, D
>>>
>>>
>>>
>>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>>>
>>> Thank you Shahab but it would be really nice if I can get some input on
>>>> my initial question as it would really help.
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>>
>>>>> I would suggest that given the level of details that you are looking
>>>>> for and fundamental nature of your questions, you should get hold of books
>>>>> or online documentation. Basically some reading/research.
>>>>>
>>>>> Latest edition of
>>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <
>>>>> ados1984@gmail.com> wrote:
>>>>>
>>>>>> Hello Team,
>>>>>>
>>>>>> I am starting off on Hadoop eco-system and wanted to learn first
>>>>>> based on my use case if Hadoop is right tool for me.
>>>>>>
>>>>>> I have only structured data and my goal is to safe this data into
>>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>>> it can have hadoop as data source for doing analysis.
>>>>>>
>>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>>>> would be very helpful.
>>>>>>
>>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>>> and so
>>>>>>
>>>>>>    1. how does data retrieval work in hadoop?
>>>>>>    2. do i always need to have any kind of data source on top of
>>>>>>    hdfs like hbase/cassandra/mongo or there is not need for one and i can have
>>>>>>    all my data stored in hdfs directly and can retrieve them when i need by
>>>>>>    using different analytic tools that have hdfs as data source?
>>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if
>>>>>>    am trying to insert data into hadoop then what is the cycle that framework
>>>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>>>    information from master node about where is my slaves nodes and what kind
>>>>>>    of data should go on which slave node or all data is send to master node
>>>>>>    and from there depending upon meta data information it reads and decides
>>>>>>    that what portion of data should be going to which node?
>>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and
>>>>>>    if my data is equally distributed in two nodes and if i have replication
>>>>>>    set to 2 then where and how will replication take place as i do not have
>>>>>>    any node vacant for doing replication?
>>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>>>    then where can we define that which is master node and which is slave node
>>>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>>>    nodes on different machines?
>>>>>>    6. Also, what are the pros and cons with going through
>>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>>>    view?
>>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>>    are come clubbed together with Hadoop framework and if we go with Apache
>>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>>>    install them separately?
>>>>>>
>>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>>> appreciate if community can point me in right direction?
>>>>>>
>>>>>> Regards, Andy.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

okies, thank you D, i will start playing around with the Sandbox version.




On Thu, Mar 13, 2014 at 5:55 AM, Dieter De Witte <dr...@gmail.com> wrote:

> Sandbox is just meant to be a learning environment i guess, to see what's
> possible, how things can be connected. The real distribution will have much
> higher performance and is the one you need when you want to investigate
> performance issues. The only real drawback of the real distributions is
> that they take more time to get you started when you sometimes just want to
> play around..
>
>
> 2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Hey D,
>> Regarding your point 5: "For a proof of concept I would use a ready-made
>> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>>
>> I want to understand how this virtual setup would work and how much
>> master and slaves nodes I can have in this virtual setup and in general
>> what are differences between the actual Hadoop Distribution to this virtual
>> ready made setups?
>>
>> Regards, Andy.
>>
>>
>>
>> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>>> specify an input and in your map and reduce function define some
>>> functionality to deal with this input. No need for HBase,... although they
>>> can be extremely useful..
>>> 3) this is all in the hadoop reference: first the namenode finds a place
>>> to allocate your data, then it gets copied to the corresponding datanode 1,
>>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>>> special meaning)
>>> 4) Your data will be on both datanodes. Why would that be a problem?
>>> 5) For a proof of concept I would use a ready-made virtual machine from
>>> one of the three big vendors: cloudera, mapR or hortonworks
>>> 6) Apache version is more basic, the commercial distributions have more
>>> built-in features, are easier to work with I guess
>>> 7) You have to install them seperately, the main reason to go for one of
>>> the vendors maybe?
>>>
>>> You should defintely have a look at the reference, you don't have to
>>> read it from A-Z but it contains sections where every single sentence will
>>> answer one of your questions..
>>>
>>> Regards, D
>>>
>>>
>>>
>>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>>>
>>> Thank you Shahab but it would be really nice if I can get some input on
>>>> my initial question as it would really help.
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>>
>>>>> I would suggest that given the level of details that you are looking
>>>>> for and fundamental nature of your questions, you should get hold of books
>>>>> or online documentation. Basically some reading/research.
>>>>>
>>>>> Latest edition of
>>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <
>>>>> ados1984@gmail.com> wrote:
>>>>>
>>>>>> Hello Team,
>>>>>>
>>>>>> I am starting off on Hadoop eco-system and wanted to learn first
>>>>>> based on my use case if Hadoop is right tool for me.
>>>>>>
>>>>>> I have only structured data and my goal is to safe this data into
>>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>>> it can have hadoop as data source for doing analysis.
>>>>>>
>>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>>>> would be very helpful.
>>>>>>
>>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>>> and so
>>>>>>
>>>>>>    1. how does data retrieval work in hadoop?
>>>>>>    2. do i always need to have any kind of data source on top of
>>>>>>    hdfs like hbase/cassandra/mongo or there is not need for one and i can have
>>>>>>    all my data stored in hdfs directly and can retrieve them when i need by
>>>>>>    using different analytic tools that have hdfs as data source?
>>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if
>>>>>>    am trying to insert data into hadoop then what is the cycle that framework
>>>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>>>    information from master node about where is my slaves nodes and what kind
>>>>>>    of data should go on which slave node or all data is send to master node
>>>>>>    and from there depending upon meta data information it reads and decides
>>>>>>    that what portion of data should be going to which node?
>>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and
>>>>>>    if my data is equally distributed in two nodes and if i have replication
>>>>>>    set to 2 then where and how will replication take place as i do not have
>>>>>>    any node vacant for doing replication?
>>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>>>    then where can we define that which is master node and which is slave node
>>>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>>>    nodes on different machines?
>>>>>>    6. Also, what are the pros and cons with going through
>>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>>>    view?
>>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>>    are come clubbed together with Hadoop framework and if we go with Apache
>>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>>>    install them separately?
>>>>>>
>>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>>> appreciate if community can point me in right direction?
>>>>>>
>>>>>> Regards, Andy.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

okies, thank you D, i will start playing around with the Sandbox version.




On Thu, Mar 13, 2014 at 5:55 AM, Dieter De Witte <dr...@gmail.com> wrote:

> Sandbox is just meant to be a learning environment i guess, to see what's
> possible, how things can be connected. The real distribution will have much
> higher performance and is the one you need when you want to investigate
> performance issues. The only real drawback of the real distributions is
> that they take more time to get you started when you sometimes just want to
> play around..
>
>
> 2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Hey D,
>> Regarding your point 5: "For a proof of concept I would use a ready-made
>> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>>
>> I want to understand how this virtual setup would work and how much
>> master and slaves nodes I can have in this virtual setup and in general
>> what are differences between the actual Hadoop Distribution to this virtual
>> ready made setups?
>>
>> Regards, Andy.
>>
>>
>>
>> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>>> specify an input and in your map and reduce function define some
>>> functionality to deal with this input. No need for HBase,... although they
>>> can be extremely useful..
>>> 3) this is all in the hadoop reference: first the namenode finds a place
>>> to allocate your data, then it gets copied to the corresponding datanode 1,
>>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>>> special meaning)
>>> 4) Your data will be on both datanodes. Why would that be a problem?
>>> 5) For a proof of concept I would use a ready-made virtual machine from
>>> one of the three big vendors: cloudera, mapR or hortonworks
>>> 6) Apache version is more basic, the commercial distributions have more
>>> built-in features, are easier to work with I guess
>>> 7) You have to install them seperately, the main reason to go for one of
>>> the vendors maybe?
>>>
>>> You should defintely have a look at the reference, you don't have to
>>> read it from A-Z but it contains sections where every single sentence will
>>> answer one of your questions..
>>>
>>> Regards, D
>>>
>>>
>>>
>>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>>>
>>> Thank you Shahab but it would be really nice if I can get some input on
>>>> my initial question as it would really help.
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>>
>>>>> I would suggest that given the level of details that you are looking
>>>>> for and fundamental nature of your questions, you should get hold of books
>>>>> or online documentation. Basically some reading/research.
>>>>>
>>>>> Latest edition of
>>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <
>>>>> ados1984@gmail.com> wrote:
>>>>>
>>>>>> Hello Team,
>>>>>>
>>>>>> I am starting off on Hadoop eco-system and wanted to learn first
>>>>>> based on my use case if Hadoop is right tool for me.
>>>>>>
>>>>>> I have only structured data and my goal is to safe this data into
>>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>>> it can have hadoop as data source for doing analysis.
>>>>>>
>>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>>>> would be very helpful.
>>>>>>
>>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>>> and so
>>>>>>
>>>>>>    1. how does data retrieval work in hadoop?
>>>>>>    2. do i always need to have any kind of data source on top of
>>>>>>    hdfs like hbase/cassandra/mongo or there is not need for one and i can have
>>>>>>    all my data stored in hdfs directly and can retrieve them when i need by
>>>>>>    using different analytic tools that have hdfs as data source?
>>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if
>>>>>>    am trying to insert data into hadoop then what is the cycle that framework
>>>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>>>    information from master node about where is my slaves nodes and what kind
>>>>>>    of data should go on which slave node or all data is send to master node
>>>>>>    and from there depending upon meta data information it reads and decides
>>>>>>    that what portion of data should be going to which node?
>>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and
>>>>>>    if my data is equally distributed in two nodes and if i have replication
>>>>>>    set to 2 then where and how will replication take place as i do not have
>>>>>>    any node vacant for doing replication?
>>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>>>    then where can we define that which is master node and which is slave node
>>>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>>>    nodes on different machines?
>>>>>>    6. Also, what are the pros and cons with going through
>>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>>>    view?
>>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>>    are come clubbed together with Hadoop framework and if we go with Apache
>>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>>>    install them separately?
>>>>>>
>>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>>> appreciate if community can point me in right direction?
>>>>>>
>>>>>> Regards, Andy.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by Dieter De Witte <dr...@gmail.com>.

Sandbox is just meant to be a learning environment i guess, to see what's
possible, how things can be connected. The real distribution will have much
higher performance and is the one you need when you want to investigate
performance issues. The only real drawback of the real distributions is
that they take more time to get you started when you sometimes just want to
play around..


2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:

> Hey D,
> Regarding your point 5: "For a proof of concept I would use a ready-made
> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>
> I want to understand how this virtual setup would work and how much master
> and slaves nodes I can have in this virtual setup and in general what are
> differences between the actual Hadoop Distribution to this virtual ready
> made setups?
>
> Regards, Andy.
>
>
>
> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> Hi,
>>
>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>> specify an input and in your map and reduce function define some
>> functionality to deal with this input. No need for HBase,... although they
>> can be extremely useful..
>> 3) this is all in the hadoop reference: first the namenode finds a place
>> to allocate your data, then it gets copied to the corresponding datanode 1,
>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>> special meaning)
>> 4) Your data will be on both datanodes. Why would that be a problem?
>> 5) For a proof of concept I would use a ready-made virtual machine from
>> one of the three big vendors: cloudera, mapR or hortonworks
>> 6) Apache version is more basic, the commercial distributions have more
>> built-in features, are easier to work with I guess
>> 7) You have to install them seperately, the main reason to go for one of
>> the vendors maybe?
>>
>> You should defintely have a look at the reference, you don't have to read
>> it from A-Z but it contains sections where every single sentence will
>> answer one of your questions..
>>
>> Regards, D
>>
>>
>>
>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>>
>> Thank you Shahab but it would be really nice if I can get some input on
>>> my initial question as it would really help.
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> I would suggest that given the level of details that you are looking
>>>> for and fundamental nature of your questions, you should get hold of books
>>>> or online documentation. Basically some reading/research.
>>>>
>>>> Latest edition of
>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ados1984@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello Team,
>>>>>
>>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>>> on my use case if Hadoop is right tool for me.
>>>>>
>>>>> I have only structured data and my goal is to safe this data into
>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>> it can have hadoop as data source for doing analysis.
>>>>>
>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>>> would be very helpful.
>>>>>
>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>> and so
>>>>>
>>>>>    1. how does data retrieval work in hadoop?
>>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>>    different analytic tools that have hdfs as data source?
>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>>    information from master node about where is my slaves nodes and what kind
>>>>>    of data should go on which slave node or all data is send to master node
>>>>>    and from there depending upon meta data information it reads and decides
>>>>>    that what portion of data should be going to which node?
>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>>    to 2 then where and how will replication take place as i do not have any
>>>>>    node vacant for doing replication?
>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>>    then where can we define that which is master node and which is slave node
>>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>>    nodes on different machines?
>>>>>    6. Also, what are the pros and cons with going through
>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>>    view?
>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>    are come clubbed together with Hadoop framework and if we go with Apache
>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>>    install them separately?
>>>>>
>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>> appreciate if community can point me in right direction?
>>>>>
>>>>> Regards, Andy.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by Dieter De Witte <dr...@gmail.com>.

Sandbox is just meant to be a learning environment i guess, to see what's
possible, how things can be connected. The real distribution will have much
higher performance and is the one you need when you want to investigate
performance issues. The only real drawback of the real distributions is
that they take more time to get you started when you sometimes just want to
play around..


2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:

> Hey D,
> Regarding your point 5: "For a proof of concept I would use a ready-made
> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>
> I want to understand how this virtual setup would work and how much master
> and slaves nodes I can have in this virtual setup and in general what are
> differences between the actual Hadoop Distribution to this virtual ready
> made setups?
>
> Regards, Andy.
>
>
>
> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> Hi,
>>
>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>> specify an input and in your map and reduce function define some
>> functionality to deal with this input. No need for HBase,... although they
>> can be extremely useful..
>> 3) this is all in the hadoop reference: first the namenode finds a place
>> to allocate your data, then it gets copied to the corresponding datanode 1,
>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>> special meaning)
>> 4) Your data will be on both datanodes. Why would that be a problem?
>> 5) For a proof of concept I would use a ready-made virtual machine from
>> one of the three big vendors: cloudera, mapR or hortonworks
>> 6) Apache version is more basic, the commercial distributions have more
>> built-in features, are easier to work with I guess
>> 7) You have to install them seperately, the main reason to go for one of
>> the vendors maybe?
>>
>> You should defintely have a look at the reference, you don't have to read
>> it from A-Z but it contains sections where every single sentence will
>> answer one of your questions..
>>
>> Regards, D
>>
>>
>>
>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>>
>> Thank you Shahab but it would be really nice if I can get some input on
>>> my initial question as it would really help.
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> I would suggest that given the level of details that you are looking
>>>> for and fundamental nature of your questions, you should get hold of books
>>>> or online documentation. Basically some reading/research.
>>>>
>>>> Latest edition of
>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ados1984@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello Team,
>>>>>
>>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>>> on my use case if Hadoop is right tool for me.
>>>>>
>>>>> I have only structured data and my goal is to safe this data into
>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>> it can have hadoop as data source for doing analysis.
>>>>>
>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>>> would be very helpful.
>>>>>
>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>> and so
>>>>>
>>>>>    1. how does data retrieval work in hadoop?
>>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>>    different analytic tools that have hdfs as data source?
>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>>    information from master node about where is my slaves nodes and what kind
>>>>>    of data should go on which slave node or all data is send to master node
>>>>>    and from there depending upon meta data information it reads and decides
>>>>>    that what portion of data should be going to which node?
>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>>    to 2 then where and how will replication take place as i do not have any
>>>>>    node vacant for doing replication?
>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>>    then where can we define that which is master node and which is slave node
>>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>>    nodes on different machines?
>>>>>    6. Also, what are the pros and cons with going through
>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>>    view?
>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>    are come clubbed together with Hadoop framework and if we go with Apache
>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>>    install them separately?
>>>>>
>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>> appreciate if community can point me in right direction?
>>>>>
>>>>> Regards, Andy.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by Dieter De Witte <dr...@gmail.com>.

Sandbox is just meant to be a learning environment i guess, to see what's
possible, how things can be connected. The real distribution will have much
higher performance and is the one you need when you want to investigate
performance issues. The only real drawback of the real distributions is
that they take more time to get you started when you sometimes just want to
play around..


2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:

> Hey D,
> Regarding your point 5: "For a proof of concept I would use a ready-made
> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>
> I want to understand how this virtual setup would work and how much master
> and slaves nodes I can have in this virtual setup and in general what are
> differences between the actual Hadoop Distribution to this virtual ready
> made setups?
>
> Regards, Andy.
>
>
>
> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> Hi,
>>
>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>> specify an input and in your map and reduce function define some
>> functionality to deal with this input. No need for HBase,... although they
>> can be extremely useful..
>> 3) this is all in the hadoop reference: first the namenode finds a place
>> to allocate your data, then it gets copied to the corresponding datanode 1,
>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>> special meaning)
>> 4) Your data will be on both datanodes. Why would that be a problem?
>> 5) For a proof of concept I would use a ready-made virtual machine from
>> one of the three big vendors: cloudera, mapR or hortonworks
>> 6) Apache version is more basic, the commercial distributions have more
>> built-in features, are easier to work with I guess
>> 7) You have to install them seperately, the main reason to go for one of
>> the vendors maybe?
>>
>> You should defintely have a look at the reference, you don't have to read
>> it from A-Z but it contains sections where every single sentence will
>> answer one of your questions..
>>
>> Regards, D
>>
>>
>>
>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>>
>> Thank you Shahab but it would be really nice if I can get some input on
>>> my initial question as it would really help.
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> I would suggest that given the level of details that you are looking
>>>> for and fundamental nature of your questions, you should get hold of books
>>>> or online documentation. Basically some reading/research.
>>>>
>>>> Latest edition of
>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ados1984@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello Team,
>>>>>
>>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>>> on my use case if Hadoop is right tool for me.
>>>>>
>>>>> I have only structured data and my goal is to safe this data into
>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>> it can have hadoop as data source for doing analysis.
>>>>>
>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>>> would be very helpful.
>>>>>
>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>> and so
>>>>>
>>>>>    1. how does data retrieval work in hadoop?
>>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>>    different analytic tools that have hdfs as data source?
>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>>    information from master node about where is my slaves nodes and what kind
>>>>>    of data should go on which slave node or all data is send to master node
>>>>>    and from there depending upon meta data information it reads and decides
>>>>>    that what portion of data should be going to which node?
>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>>    to 2 then where and how will replication take place as i do not have any
>>>>>    node vacant for doing replication?
>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>>    then where can we define that which is master node and which is slave node
>>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>>    nodes on different machines?
>>>>>    6. Also, what are the pros and cons with going through
>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>>    view?
>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>    are come clubbed together with Hadoop framework and if we go with Apache
>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>>    install them separately?
>>>>>
>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>> appreciate if community can point me in right direction?
>>>>>
>>>>> Regards, Andy.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by Dieter De Witte <dr...@gmail.com>.

Sandbox is just meant to be a learning environment i guess, to see what's
possible, how things can be connected. The real distribution will have much
higher performance and is the one you need when you want to investigate
performance issues. The only real drawback of the real distributions is
that they take more time to get you started when you sometimes just want to
play around..


2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:

> Hey D,
> Regarding your point 5: "For a proof of concept I would use a ready-made
> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>
> I want to understand how this virtual setup would work and how much master
> and slaves nodes I can have in this virtual setup and in general what are
> differences between the actual Hadoop Distribution to this virtual ready
> made setups?
>
> Regards, Andy.
>
>
>
> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> Hi,
>>
>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>> specify an input and in your map and reduce function define some
>> functionality to deal with this input. No need for HBase,... although they
>> can be extremely useful..
>> 3) this is all in the hadoop reference: first the namenode finds a place
>> to allocate your data, then it gets copied to the corresponding datanode 1,
>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>> special meaning)
>> 4) Your data will be on both datanodes. Why would that be a problem?
>> 5) For a proof of concept I would use a ready-made virtual machine from
>> one of the three big vendors: cloudera, mapR or hortonworks
>> 6) Apache version is more basic, the commercial distributions have more
>> built-in features, are easier to work with I guess
>> 7) You have to install them seperately, the main reason to go for one of
>> the vendors maybe?
>>
>> You should defintely have a look at the reference, you don't have to read
>> it from A-Z but it contains sections where every single sentence will
>> answer one of your questions..
>>
>> Regards, D
>>
>>
>>
>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>>
>> Thank you Shahab but it would be really nice if I can get some input on
>>> my initial question as it would really help.
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> I would suggest that given the level of details that you are looking
>>>> for and fundamental nature of your questions, you should get hold of books
>>>> or online documentation. Basically some reading/research.
>>>>
>>>> Latest edition of
>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ados1984@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello Team,
>>>>>
>>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>>> on my use case if Hadoop is right tool for me.
>>>>>
>>>>> I have only structured data and my goal is to safe this data into
>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>> it can have hadoop as data source for doing analysis.
>>>>>
>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>>> would be very helpful.
>>>>>
>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>> and so
>>>>>
>>>>>    1. how does data retrieval work in hadoop?
>>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>>    different analytic tools that have hdfs as data source?
>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>>    information from master node about where is my slaves nodes and what kind
>>>>>    of data should go on which slave node or all data is send to master node
>>>>>    and from there depending upon meta data information it reads and decides
>>>>>    that what portion of data should be going to which node?
>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>>    to 2 then where and how will replication take place as i do not have any
>>>>>    node vacant for doing replication?
>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>>    then where can we define that which is master node and which is slave node
>>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>>    nodes on different machines?
>>>>>    6. Also, what are the pros and cons with going through
>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>>    view?
>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>    are come clubbed together with Hadoop framework and if we go with Apache
>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>>    install them separately?
>>>>>
>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>> appreciate if community can point me in right direction?
>>>>>
>>>>> Regards, Andy.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Hey D,
Regarding your point 5: "For a proof of concept I would use a ready-made
virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"

I want to understand how this virtual setup would work and how much master
and slaves nodes I can have in this virtual setup and in general what are
differences between the actual Hadoop Distribution to this virtual ready
made setups?

Regards, Andy.



On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com> wrote:

> Hi,
>
> 1) HDFS is just a file system, it hides the fact that it is distributed.
> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
> specify an input and in your map and reduce function define some
> functionality to deal with this input. No need for HBase,... although they
> can be extremely useful..
> 3) this is all in the hadoop reference: first the namenode finds a place
> to allocate your data, then it gets copied to the corresponding datanode 1,
> and from datanode 1 it is copied to datanode 2 (note the numbers have no
> special meaning)
> 4) Your data will be on both datanodes. Why would that be a problem?
> 5) For a proof of concept I would use a ready-made virtual machine from
> one of the three big vendors: cloudera, mapR or hortonworks
> 6) Apache version is more basic, the commercial distributions have more
> built-in features, are easier to work with I guess
> 7) You have to install them seperately, the main reason to go for one of
> the vendors maybe?
>
> You should defintely have a look at the reference, you don't have to read
> it from A-Z but it contains sections where every single sentence will
> answer one of your questions..
>
> Regards, D
>
>
>
> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Thank you Shahab but it would be really nice if I can get some input on my
>> initial question as it would really help.
>>
>>
>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I would suggest that given the level of details that you are looking for
>>> and fundamental nature of your questions, you should get hold of books or
>>> online documentation. Basically some reading/research.
>>>
>>> Latest edition of
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>>
>>>> Hello Team,
>>>>
>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>> on my use case if Hadoop is right tool for me.
>>>>
>>>> I have only structured data and my goal is to safe this data into
>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>> for doing analysis and it provides me with good drag and drop functionality
>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>> it can have hadoop as data source for doing analysis.
>>>>
>>>> My question here is how benefits YARN architecture give me in tems of
>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>> I am just trying to understand value of introducing Hadoop in my
>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>> would be very helpful.
>>>>
>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>> and so
>>>>
>>>>    1. how does data retrieval work in hadoop?
>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>    different analytic tools that have hdfs as data source?
>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>    information from master node about where is my slaves nodes and what kind
>>>>    of data should go on which slave node or all data is send to master node
>>>>    and from there depending upon meta data information it reads and decides
>>>>    that what portion of data should be going to which node?
>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>    to 2 then where and how will replication take place as i do not have any
>>>>    node vacant for doing replication?
>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>    then where can we define that which is master node and which is slave node
>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>    nodes on different machines?
>>>>    6. Also, what are the pros and cons with going through
>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>    view?
>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>    install them separately?
>>>>
>>>> Since am staring off on Hadoop Journey recently, I would really
>>>> appreciate if community can point me in right direction?
>>>>
>>>> Regards, Andy.
>>>>
>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Thanks D, that certainly answers my question.

I was just taking quick look at Hortonworks HDP vs Hortonworks Sandbox, do
you know of any benefits of using Sandbox as opposed to Hortonworks Data
Platforms?


On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com> wrote:

> Hi,
>
> 1) HDFS is just a file system, it hides the fact that it is distributed.
> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
> specify an input and in your map and reduce function define some
> functionality to deal with this input. No need for HBase,... although they
> can be extremely useful..
> 3) this is all in the hadoop reference: first the namenode finds a place
> to allocate your data, then it gets copied to the corresponding datanode 1,
> and from datanode 1 it is copied to datanode 2 (note the numbers have no
> special meaning)
> 4) Your data will be on both datanodes. Why would that be a problem?
> 5) For a proof of concept I would use a ready-made virtual machine from
> one of the three big vendors: cloudera, mapR or hortonworks
> 6) Apache version is more basic, the commercial distributions have more
> built-in features, are easier to work with I guess
> 7) You have to install them seperately, the main reason to go for one of
> the vendors maybe?
>
> You should defintely have a look at the reference, you don't have to read
> it from A-Z but it contains sections where every single sentence will
> answer one of your questions..
>
> Regards, D
>
>
>
> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Thank you Shahab but it would be really nice if I can get some input on my
>> initial question as it would really help.
>>
>>
>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I would suggest that given the level of details that you are looking for
>>> and fundamental nature of your questions, you should get hold of books or
>>> online documentation. Basically some reading/research.
>>>
>>> Latest edition of
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>>
>>>> Hello Team,
>>>>
>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>> on my use case if Hadoop is right tool for me.
>>>>
>>>> I have only structured data and my goal is to safe this data into
>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>> for doing analysis and it provides me with good drag and drop functionality
>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>> it can have hadoop as data source for doing analysis.
>>>>
>>>> My question here is how benefits YARN architecture give me in tems of
>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>> I am just trying to understand value of introducing Hadoop in my
>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>> would be very helpful.
>>>>
>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>> and so
>>>>
>>>>    1. how does data retrieval work in hadoop?
>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>    different analytic tools that have hdfs as data source?
>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>    information from master node about where is my slaves nodes and what kind
>>>>    of data should go on which slave node or all data is send to master node
>>>>    and from there depending upon meta data information it reads and decides
>>>>    that what portion of data should be going to which node?
>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>    to 2 then where and how will replication take place as i do not have any
>>>>    node vacant for doing replication?
>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>    then where can we define that which is master node and which is slave node
>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>    nodes on different machines?
>>>>    6. Also, what are the pros and cons with going through
>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>    view?
>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>    install them separately?
>>>>
>>>> Since am staring off on Hadoop Journey recently, I would really
>>>> appreciate if community can point me in right direction?
>>>>
>>>> Regards, Andy.
>>>>
>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Hey D,
Regarding your point 5: "For a proof of concept I would use a ready-made
virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"

I want to understand how this virtual setup would work and how much master
and slaves nodes I can have in this virtual setup and in general what are
differences between the actual Hadoop Distribution to this virtual ready
made setups?

Regards, Andy.



On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com> wrote:

> Hi,
>
> 1) HDFS is just a file system, it hides the fact that it is distributed.
> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
> specify an input and in your map and reduce function define some
> functionality to deal with this input. No need for HBase,... although they
> can be extremely useful..
> 3) this is all in the hadoop reference: first the namenode finds a place
> to allocate your data, then it gets copied to the corresponding datanode 1,
> and from datanode 1 it is copied to datanode 2 (note the numbers have no
> special meaning)
> 4) Your data will be on both datanodes. Why would that be a problem?
> 5) For a proof of concept I would use a ready-made virtual machine from
> one of the three big vendors: cloudera, mapR or hortonworks
> 6) Apache version is more basic, the commercial distributions have more
> built-in features, are easier to work with I guess
> 7) You have to install them seperately, the main reason to go for one of
> the vendors maybe?
>
> You should defintely have a look at the reference, you don't have to read
> it from A-Z but it contains sections where every single sentence will
> answer one of your questions..
>
> Regards, D
>
>
>
> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Thank you Shahab but it would be really nice if I can get some input on my
>> initial question as it would really help.
>>
>>
>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I would suggest that given the level of details that you are looking for
>>> and fundamental nature of your questions, you should get hold of books or
>>> online documentation. Basically some reading/research.
>>>
>>> Latest edition of
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>>
>>>> Hello Team,
>>>>
>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>> on my use case if Hadoop is right tool for me.
>>>>
>>>> I have only structured data and my goal is to safe this data into
>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>> for doing analysis and it provides me with good drag and drop functionality
>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>> it can have hadoop as data source for doing analysis.
>>>>
>>>> My question here is how benefits YARN architecture give me in tems of
>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>> I am just trying to understand value of introducing Hadoop in my
>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>> would be very helpful.
>>>>
>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>> and so
>>>>
>>>>    1. how does data retrieval work in hadoop?
>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>    different analytic tools that have hdfs as data source?
>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>    information from master node about where is my slaves nodes and what kind
>>>>    of data should go on which slave node or all data is send to master node
>>>>    and from there depending upon meta data information it reads and decides
>>>>    that what portion of data should be going to which node?
>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>    to 2 then where and how will replication take place as i do not have any
>>>>    node vacant for doing replication?
>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>    then where can we define that which is master node and which is slave node
>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>    nodes on different machines?
>>>>    6. Also, what are the pros and cons with going through
>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>    view?
>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>    install them separately?
>>>>
>>>> Since am staring off on Hadoop Journey recently, I would really
>>>> appreciate if community can point me in right direction?
>>>>
>>>> Regards, Andy.
>>>>
>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Thanks D, that certainly answers my question.

I was just taking quick look at Hortonworks HDP vs Hortonworks Sandbox, do
you know of any benefits of using Sandbox as opposed to Hortonworks Data
Platforms?


On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com> wrote:

> Hi,
>
> 1) HDFS is just a file system, it hides the fact that it is distributed.
> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
> specify an input and in your map and reduce function define some
> functionality to deal with this input. No need for HBase,... although they
> can be extremely useful..
> 3) this is all in the hadoop reference: first the namenode finds a place
> to allocate your data, then it gets copied to the corresponding datanode 1,
> and from datanode 1 it is copied to datanode 2 (note the numbers have no
> special meaning)
> 4) Your data will be on both datanodes. Why would that be a problem?
> 5) For a proof of concept I would use a ready-made virtual machine from
> one of the three big vendors: cloudera, mapR or hortonworks
> 6) Apache version is more basic, the commercial distributions have more
> built-in features, are easier to work with I guess
> 7) You have to install them seperately, the main reason to go for one of
> the vendors maybe?
>
> You should defintely have a look at the reference, you don't have to read
> it from A-Z but it contains sections where every single sentence will
> answer one of your questions..
>
> Regards, D
>
>
>
> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Thank you Shahab but it would be really nice if I can get some input on my
>> initial question as it would really help.
>>
>>
>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I would suggest that given the level of details that you are looking for
>>> and fundamental nature of your questions, you should get hold of books or
>>> online documentation. Basically some reading/research.
>>>
>>> Latest edition of
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>>
>>>> Hello Team,
>>>>
>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>> on my use case if Hadoop is right tool for me.
>>>>
>>>> I have only structured data and my goal is to safe this data into
>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>> for doing analysis and it provides me with good drag and drop functionality
>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>> it can have hadoop as data source for doing analysis.
>>>>
>>>> My question here is how benefits YARN architecture give me in tems of
>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>> I am just trying to understand value of introducing Hadoop in my
>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>> would be very helpful.
>>>>
>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>> and so
>>>>
>>>>    1. how does data retrieval work in hadoop?
>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>    different analytic tools that have hdfs as data source?
>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>    information from master node about where is my slaves nodes and what kind
>>>>    of data should go on which slave node or all data is send to master node
>>>>    and from there depending upon meta data information it reads and decides
>>>>    that what portion of data should be going to which node?
>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>    to 2 then where and how will replication take place as i do not have any
>>>>    node vacant for doing replication?
>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>    then where can we define that which is master node and which is slave node
>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>    nodes on different machines?
>>>>    6. Also, what are the pros and cons with going through
>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>    view?
>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>    install them separately?
>>>>
>>>> Since am staring off on Hadoop Journey recently, I would really
>>>> appreciate if community can point me in right direction?
>>>>
>>>> Regards, Andy.
>>>>
>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Hey D,
Regarding your point 5: "For a proof of concept I would use a ready-made
virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"

I want to understand how this virtual setup would work and how much master
and slaves nodes I can have in this virtual setup and in general what are
differences between the actual Hadoop Distribution to this virtual ready
made setups?

Regards, Andy.



On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com> wrote:

> Hi,
>
> 1) HDFS is just a file system, it hides the fact that it is distributed.
> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
> specify an input and in your map and reduce function define some
> functionality to deal with this input. No need for HBase,... although they
> can be extremely useful..
> 3) this is all in the hadoop reference: first the namenode finds a place
> to allocate your data, then it gets copied to the corresponding datanode 1,
> and from datanode 1 it is copied to datanode 2 (note the numbers have no
> special meaning)
> 4) Your data will be on both datanodes. Why would that be a problem?
> 5) For a proof of concept I would use a ready-made virtual machine from
> one of the three big vendors: cloudera, mapR or hortonworks
> 6) Apache version is more basic, the commercial distributions have more
> built-in features, are easier to work with I guess
> 7) You have to install them seperately, the main reason to go for one of
> the vendors maybe?
>
> You should defintely have a look at the reference, you don't have to read
> it from A-Z but it contains sections where every single sentence will
> answer one of your questions..
>
> Regards, D
>
>
>
> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Thank you Shahab but it would be really nice if I can get some input on my
>> initial question as it would really help.
>>
>>
>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I would suggest that given the level of details that you are looking for
>>> and fundamental nature of your questions, you should get hold of books or
>>> online documentation. Basically some reading/research.
>>>
>>> Latest edition of
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>>
>>>> Hello Team,
>>>>
>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>> on my use case if Hadoop is right tool for me.
>>>>
>>>> I have only structured data and my goal is to safe this data into
>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>> for doing analysis and it provides me with good drag and drop functionality
>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>> it can have hadoop as data source for doing analysis.
>>>>
>>>> My question here is how benefits YARN architecture give me in tems of
>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>> I am just trying to understand value of introducing Hadoop in my
>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>> would be very helpful.
>>>>
>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>> and so
>>>>
>>>>    1. how does data retrieval work in hadoop?
>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>    different analytic tools that have hdfs as data source?
>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>    information from master node about where is my slaves nodes and what kind
>>>>    of data should go on which slave node or all data is send to master node
>>>>    and from there depending upon meta data information it reads and decides
>>>>    that what portion of data should be going to which node?
>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>    to 2 then where and how will replication take place as i do not have any
>>>>    node vacant for doing replication?
>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>    then where can we define that which is master node and which is slave node
>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>    nodes on different machines?
>>>>    6. Also, what are the pros and cons with going through
>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>    view?
>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>    install them separately?
>>>>
>>>> Since am staring off on Hadoop Journey recently, I would really
>>>> appreciate if community can point me in right direction?
>>>>
>>>> Regards, Andy.
>>>>
>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Hey D,
Regarding your point 5: "For a proof of concept I would use a ready-made
virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"

I want to understand how this virtual setup would work and how much master
and slaves nodes I can have in this virtual setup and in general what are
differences between the actual Hadoop Distribution to this virtual ready
made setups?

Regards, Andy.



On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <dr...@gmail.com> wrote:

> Hi,
>
> 1) HDFS is just a file system, it hides the fact that it is distributed.
> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
> specify an input and in your map and reduce function define some
> functionality to deal with this input. No need for HBase,... although they
> can be extremely useful..
> 3) this is all in the hadoop reference: first the namenode finds a place
> to allocate your data, then it gets copied to the corresponding datanode 1,
> and from datanode 1 it is copied to datanode 2 (note the numbers have no
> special meaning)
> 4) Your data will be on both datanodes. Why would that be a problem?
> 5) For a proof of concept I would use a ready-made virtual machine from
> one of the three big vendors: cloudera, mapR or hortonworks
> 6) Apache version is more basic, the commercial distributions have more
> built-in features, are easier to work with I guess
> 7) You have to install them seperately, the main reason to go for one of
> the vendors maybe?
>
> You should defintely have a look at the reference, you don't have to read
> it from A-Z but it contains sections where every single sentence will
> answer one of your questions..
>
> Regards, D
>
>
>
> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:
>
> Thank you Shahab but it would be really nice if I can get some input on my
>> initial question as it would really help.
>>
>>
>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I would suggest that given the level of details that you are looking for
>>> and fundamental nature of your questions, you should get hold of books or
>>> online documentation. Basically some reading/research.
>>>
>>> Latest edition of
>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is highly recommended to begin with.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>>
>>>> Hello Team,
>>>>
>>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>>> on my use case if Hadoop is right tool for me.
>>>>
>>>> I have only structured data and my goal is to safe this data into
>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>> for doing analysis and it provides me with good drag and drop functionality
>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>> it can have hadoop as data source for doing analysis.
>>>>
>>>> My question here is how benefits YARN architecture give me in tems of
>>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>>> I am just trying to understand value of introducing Hadoop in my
>>>> Architecture in terms of Analysis apart from data replication. Any insights
>>>> would be very helpful.
>>>>
>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>> and so
>>>>
>>>>    1. how does data retrieval work in hadoop?
>>>>    2. do i always need to have any kind of data source on top of hdfs
>>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>>    different analytic tools that have hdfs as data source?
>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>>    trying to insert data into hadoop then what is the cycle that framework
>>>>    performs to install my data into hdfs - does my process reads all meta data
>>>>    information from master node about where is my slaves nodes and what kind
>>>>    of data should go on which slave node or all data is send to master node
>>>>    and from there depending upon meta data information it reads and decides
>>>>    that what portion of data should be going to which node?
>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>>    my data is equally distributed in two nodes and if i have replication set
>>>>    to 2 then where and how will replication take place as i do not have any
>>>>    node vacant for doing replication?
>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to go
>>>>    with opensource hadoop version and if we go with open source hadoop version
>>>>    then where can we define that which is master node and which is slave node
>>>>    and also can we have all 3 nodes on same machine or we need to have all 3
>>>>    nodes on different machines?
>>>>    6. Also, what are the pros and cons with going through
>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>>    view?
>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>>    install them separately?
>>>>
>>>> Since am staring off on Hadoop Journey recently, I would really
>>>> appreciate if community can point me in right direction?
>>>>
>>>> Regards, Andy.
>>>>
>>>
>>>
>>
>

Re: Use Cases for Structured Data

Posted by Dieter De Witte <dr...@gmail.com>.

Hi,

1) HDFS is just a file system, it hides the fact that it is distributed.
2) Mapreduce is the most lowlevel analytics tool I think, you can just
specify an input and in your map and reduce function define some
functionality to deal with this input. No need for HBase,... although they
can be extremely useful..
3) this is all in the hadoop reference: first the namenode finds a place to
allocate your data, then it gets copied to the corresponding datanode 1,
and from datanode 1 it is copied to datanode 2 (note the numbers have no
special meaning)
4) Your data will be on both datanodes. Why would that be a problem?
5) For a proof of concept I would use a ready-made virtual machine from one
of the three big vendors: cloudera, mapR or hortonworks
6) Apache version is more basic, the commercial distributions have more
built-in features, are easier to work with I guess
7) You have to install them seperately, the main reason to go for one of
the vendors maybe?

You should defintely have a look at the reference, you don't have to read
it from A-Z but it contains sections where every single sentence will
answer one of your questions..

Regards, D



2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:

> Thank you Shahab but it would be really nice if I can get some input on my
> initial question as it would really help.
>
>
> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I would suggest that given the level of details that you are looking for
>> and fundamental nature of your questions, you should get hold of books or
>> online documentation. Basically some reading/research.
>>
>> Latest edition of
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
>> highly recommended to begin with.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>
>>> Hello Team,
>>>
>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>> on my use case if Hadoop is right tool for me.
>>>
>>> I have only structured data and my goal is to safe this data into Hadoop
>>> and take benefit of replication factor. I am using Microsoft tools for
>>> doing analysis and it provides me with good drag and drop functionality for
>>> creating different kind of anaylsis and also it has hadoop drivers so it
>>> can have hadoop as data source for doing analysis.
>>>
>>> My question here is how benefits YARN architecture give me in tems of
>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>> I am just trying to understand value of introducing Hadoop in my
>>> Architecture in terms of Analysis apart from data replication. Any insights
>>> would be very helpful.
>>>
>>> Also, my goal for POC is related to efficient data storage/retrieval and
>>> so
>>>
>>>    1. how does data retrieval work in hadoop?
>>>    2. do i always need to have any kind of data source on top of hdfs
>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>    different analytic tools that have hdfs as data source?
>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>    trying to insert data into hadoop then what is the cycle that framework
>>>    performs to install my data into hdfs - does my process reads all meta data
>>>    information from master node about where is my slaves nodes and what kind
>>>    of data should go on which slave node or all data is send to master node
>>>    and from there depending upon meta data information it reads and decides
>>>    that what portion of data should be going to which node?
>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>    my data is equally distributed in two nodes and if i have replication set
>>>    to 2 then where and how will replication take place as i do not have any
>>>    node vacant for doing replication?
>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>>    opensource hadoop version and if we go with open source hadoop version then
>>>    where can we define that which is master node and which is slave node and
>>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>>    on different machines?
>>>    6. Also, what are the pros and cons with going through
>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>    view?
>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>    install them separately?
>>>
>>> Since am staring off on Hadoop Journey recently, I would really
>>> appreciate if community can point me in right direction?
>>>
>>> Regards, Andy.
>>>
>>
>>
>

Re: Use Cases for Structured Data

Posted by Shahab Yunus <sh...@gmail.com>.

Assuming that the following is your initial questions:
*"My question here is how benefits YARN architecture give me in tems of
analysis that my Microsoft, Netezza of Tableau products are not giving me.
I am just trying to understand value of introducing Hadoop in my
Architecture in terms of Analysis apart from data replication."*

YARN is not 'data storage' (hence replication does not matter her) or even
only just 'Analytics'. It is a framework that allows you to write or
implement applications which are based on distributed architecture. In
pre-Yarn world it was Map/Reduce but in YARN now you can not only work on
Map/Reduce based applications but any other distributed paradigm. YARN
helps you in taking care of lots of boiler-plate plus advance plumbing and
infrastructure concerns usually encountered with parallel distributed
applications.

I have not worked in Netezza but from what I understand, Yarn and the
distributed applications build on it are not just for Analytics. They can
be any application which wants to leverage a highly distributed and
parallel architecture and design. So basically YARN is giving you a
environment where you can implement applications using custom (or the
popular Map/Reduce) paradigm to build parallel an distributed applications.

As YARN is open source, so you have much more control over it. You can get
the code and modify it to your own needs which I presume is nota available
or possible in Netezza.

I didn't get what application you are mentioning when you said 'Microsoft'.

Others, experts can chime in.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:37 PM, ados1984@gmail.com <ad...@gmail.com>wrote:

> Thank you Shahab but it would be really nice if I can get some input on my
> initial question as it would really help.
>
>
> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I would suggest that given the level of details that you are looking for
>> and fundamental nature of your questions, you should get hold of books or
>> online documentation. Basically some reading/research.
>>
>> Latest edition of
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
>> highly recommended to begin with.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>
>>> Hello Team,
>>>
>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>> on my use case if Hadoop is right tool for me.
>>>
>>> I have only structured data and my goal is to safe this data into Hadoop
>>> and take benefit of replication factor. I am using Microsoft tools for
>>> doing analysis and it provides me with good drag and drop functionality for
>>> creating different kind of anaylsis and also it has hadoop drivers so it
>>> can have hadoop as data source for doing analysis.
>>>
>>> My question here is how benefits YARN architecture give me in tems of
>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>> I am just trying to understand value of introducing Hadoop in my
>>> Architecture in terms of Analysis apart from data replication. Any insights
>>> would be very helpful.
>>>
>>> Also, my goal for POC is related to efficient data storage/retrieval and
>>> so
>>>
>>>    1. how does data retrieval work in hadoop?
>>>    2. do i always need to have any kind of data source on top of hdfs
>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>    different analytic tools that have hdfs as data source?
>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>    trying to insert data into hadoop then what is the cycle that framework
>>>    performs to install my data into hdfs - does my process reads all meta data
>>>    information from master node about where is my slaves nodes and what kind
>>>    of data should go on which slave node or all data is send to master node
>>>    and from there depending upon meta data information it reads and decides
>>>    that what portion of data should be going to which node?
>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>    my data is equally distributed in two nodes and if i have replication set
>>>    to 2 then where and how will replication take place as i do not have any
>>>    node vacant for doing replication?
>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>>    opensource hadoop version and if we go with open source hadoop version then
>>>    where can we define that which is master node and which is slave node and
>>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>>    on different machines?
>>>    6. Also, what are the pros and cons with going through
>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>    view?
>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>    install them separately?
>>>
>>> Since am staring off on Hadoop Journey recently, I would really
>>> appreciate if community can point me in right direction?
>>>
>>> Regards, Andy.
>>>
>>
>>
>

Re: Use Cases for Structured Data

Posted by Shahab Yunus <sh...@gmail.com>.

Assuming that the following is your initial questions:
*"My question here is how benefits YARN architecture give me in tems of
analysis that my Microsoft, Netezza of Tableau products are not giving me.
I am just trying to understand value of introducing Hadoop in my
Architecture in terms of Analysis apart from data replication."*

YARN is not 'data storage' (hence replication does not matter her) or even
only just 'Analytics'. It is a framework that allows you to write or
implement applications which are based on distributed architecture. In
pre-Yarn world it was Map/Reduce but in YARN now you can not only work on
Map/Reduce based applications but any other distributed paradigm. YARN
helps you in taking care of lots of boiler-plate plus advance plumbing and
infrastructure concerns usually encountered with parallel distributed
applications.

I have not worked in Netezza but from what I understand, Yarn and the
distributed applications build on it are not just for Analytics. They can
be any application which wants to leverage a highly distributed and
parallel architecture and design. So basically YARN is giving you a
environment where you can implement applications using custom (or the
popular Map/Reduce) paradigm to build parallel an distributed applications.

As YARN is open source, so you have much more control over it. You can get
the code and modify it to your own needs which I presume is nota available
or possible in Netezza.

I didn't get what application you are mentioning when you said 'Microsoft'.

Others, experts can chime in.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:37 PM, ados1984@gmail.com <ad...@gmail.com>wrote:

> Thank you Shahab but it would be really nice if I can get some input on my
> initial question as it would really help.
>
>
> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I would suggest that given the level of details that you are looking for
>> and fundamental nature of your questions, you should get hold of books or
>> online documentation. Basically some reading/research.
>>
>> Latest edition of
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
>> highly recommended to begin with.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>
>>> Hello Team,
>>>
>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>> on my use case if Hadoop is right tool for me.
>>>
>>> I have only structured data and my goal is to safe this data into Hadoop
>>> and take benefit of replication factor. I am using Microsoft tools for
>>> doing analysis and it provides me with good drag and drop functionality for
>>> creating different kind of anaylsis and also it has hadoop drivers so it
>>> can have hadoop as data source for doing analysis.
>>>
>>> My question here is how benefits YARN architecture give me in tems of
>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>> I am just trying to understand value of introducing Hadoop in my
>>> Architecture in terms of Analysis apart from data replication. Any insights
>>> would be very helpful.
>>>
>>> Also, my goal for POC is related to efficient data storage/retrieval and
>>> so
>>>
>>>    1. how does data retrieval work in hadoop?
>>>    2. do i always need to have any kind of data source on top of hdfs
>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>    different analytic tools that have hdfs as data source?
>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>    trying to insert data into hadoop then what is the cycle that framework
>>>    performs to install my data into hdfs - does my process reads all meta data
>>>    information from master node about where is my slaves nodes and what kind
>>>    of data should go on which slave node or all data is send to master node
>>>    and from there depending upon meta data information it reads and decides
>>>    that what portion of data should be going to which node?
>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>    my data is equally distributed in two nodes and if i have replication set
>>>    to 2 then where and how will replication take place as i do not have any
>>>    node vacant for doing replication?
>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>>    opensource hadoop version and if we go with open source hadoop version then
>>>    where can we define that which is master node and which is slave node and
>>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>>    on different machines?
>>>    6. Also, what are the pros and cons with going through
>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>    view?
>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>    install them separately?
>>>
>>> Since am staring off on Hadoop Journey recently, I would really
>>> appreciate if community can point me in right direction?
>>>
>>> Regards, Andy.
>>>
>>
>>
>

Re: Use Cases for Structured Data

Posted by Dieter De Witte <dr...@gmail.com>.

Hi,

1) HDFS is just a file system, it hides the fact that it is distributed.
2) Mapreduce is the most lowlevel analytics tool I think, you can just
specify an input and in your map and reduce function define some
functionality to deal with this input. No need for HBase,... although they
can be extremely useful..
3) this is all in the hadoop reference: first the namenode finds a place to
allocate your data, then it gets copied to the corresponding datanode 1,
and from datanode 1 it is copied to datanode 2 (note the numbers have no
special meaning)
4) Your data will be on both datanodes. Why would that be a problem?
5) For a proof of concept I would use a ready-made virtual machine from one
of the three big vendors: cloudera, mapR or hortonworks
6) Apache version is more basic, the commercial distributions have more
built-in features, are easier to work with I guess
7) You have to install them seperately, the main reason to go for one of
the vendors maybe?

You should defintely have a look at the reference, you don't have to read
it from A-Z but it contains sections where every single sentence will
answer one of your questions..

Regards, D



2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:

> Thank you Shahab but it would be really nice if I can get some input on my
> initial question as it would really help.
>
>
> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I would suggest that given the level of details that you are looking for
>> and fundamental nature of your questions, you should get hold of books or
>> online documentation. Basically some reading/research.
>>
>> Latest edition of
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
>> highly recommended to begin with.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>
>>> Hello Team,
>>>
>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>> on my use case if Hadoop is right tool for me.
>>>
>>> I have only structured data and my goal is to safe this data into Hadoop
>>> and take benefit of replication factor. I am using Microsoft tools for
>>> doing analysis and it provides me with good drag and drop functionality for
>>> creating different kind of anaylsis and also it has hadoop drivers so it
>>> can have hadoop as data source for doing analysis.
>>>
>>> My question here is how benefits YARN architecture give me in tems of
>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>> I am just trying to understand value of introducing Hadoop in my
>>> Architecture in terms of Analysis apart from data replication. Any insights
>>> would be very helpful.
>>>
>>> Also, my goal for POC is related to efficient data storage/retrieval and
>>> so
>>>
>>>    1. how does data retrieval work in hadoop?
>>>    2. do i always need to have any kind of data source on top of hdfs
>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>    different analytic tools that have hdfs as data source?
>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>    trying to insert data into hadoop then what is the cycle that framework
>>>    performs to install my data into hdfs - does my process reads all meta data
>>>    information from master node about where is my slaves nodes and what kind
>>>    of data should go on which slave node or all data is send to master node
>>>    and from there depending upon meta data information it reads and decides
>>>    that what portion of data should be going to which node?
>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>    my data is equally distributed in two nodes and if i have replication set
>>>    to 2 then where and how will replication take place as i do not have any
>>>    node vacant for doing replication?
>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>>    opensource hadoop version and if we go with open source hadoop version then
>>>    where can we define that which is master node and which is slave node and
>>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>>    on different machines?
>>>    6. Also, what are the pros and cons with going through
>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>    view?
>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>    install them separately?
>>>
>>> Since am staring off on Hadoop Journey recently, I would really
>>> appreciate if community can point me in right direction?
>>>
>>> Regards, Andy.
>>>
>>
>>
>

Re: Use Cases for Structured Data

Posted by Shahab Yunus <sh...@gmail.com>.

Assuming that the following is your initial questions:
*"My question here is how benefits YARN architecture give me in tems of
analysis that my Microsoft, Netezza of Tableau products are not giving me.
I am just trying to understand value of introducing Hadoop in my
Architecture in terms of Analysis apart from data replication."*

YARN is not 'data storage' (hence replication does not matter her) or even
only just 'Analytics'. It is a framework that allows you to write or
implement applications which are based on distributed architecture. In
pre-Yarn world it was Map/Reduce but in YARN now you can not only work on
Map/Reduce based applications but any other distributed paradigm. YARN
helps you in taking care of lots of boiler-plate plus advance plumbing and
infrastructure concerns usually encountered with parallel distributed
applications.

I have not worked in Netezza but from what I understand, Yarn and the
distributed applications build on it are not just for Analytics. They can
be any application which wants to leverage a highly distributed and
parallel architecture and design. So basically YARN is giving you a
environment where you can implement applications using custom (or the
popular Map/Reduce) paradigm to build parallel an distributed applications.

As YARN is open source, so you have much more control over it. You can get
the code and modify it to your own needs which I presume is nota available
or possible in Netezza.

I didn't get what application you are mentioning when you said 'Microsoft'.

Others, experts can chime in.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:37 PM, ados1984@gmail.com <ad...@gmail.com>wrote:

> Thank you Shahab but it would be really nice if I can get some input on my
> initial question as it would really help.
>
>
> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I would suggest that given the level of details that you are looking for
>> and fundamental nature of your questions, you should get hold of books or
>> online documentation. Basically some reading/research.
>>
>> Latest edition of
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
>> highly recommended to begin with.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>
>>> Hello Team,
>>>
>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>> on my use case if Hadoop is right tool for me.
>>>
>>> I have only structured data and my goal is to safe this data into Hadoop
>>> and take benefit of replication factor. I am using Microsoft tools for
>>> doing analysis and it provides me with good drag and drop functionality for
>>> creating different kind of anaylsis and also it has hadoop drivers so it
>>> can have hadoop as data source for doing analysis.
>>>
>>> My question here is how benefits YARN architecture give me in tems of
>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>> I am just trying to understand value of introducing Hadoop in my
>>> Architecture in terms of Analysis apart from data replication. Any insights
>>> would be very helpful.
>>>
>>> Also, my goal for POC is related to efficient data storage/retrieval and
>>> so
>>>
>>>    1. how does data retrieval work in hadoop?
>>>    2. do i always need to have any kind of data source on top of hdfs
>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>    different analytic tools that have hdfs as data source?
>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>    trying to insert data into hadoop then what is the cycle that framework
>>>    performs to install my data into hdfs - does my process reads all meta data
>>>    information from master node about where is my slaves nodes and what kind
>>>    of data should go on which slave node or all data is send to master node
>>>    and from there depending upon meta data information it reads and decides
>>>    that what portion of data should be going to which node?
>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>    my data is equally distributed in two nodes and if i have replication set
>>>    to 2 then where and how will replication take place as i do not have any
>>>    node vacant for doing replication?
>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>>    opensource hadoop version and if we go with open source hadoop version then
>>>    where can we define that which is master node and which is slave node and
>>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>>    on different machines?
>>>    6. Also, what are the pros and cons with going through
>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>    view?
>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>    install them separately?
>>>
>>> Since am staring off on Hadoop Journey recently, I would really
>>> appreciate if community can point me in right direction?
>>>
>>> Regards, Andy.
>>>
>>
>>
>

Re: Use Cases for Structured Data

Posted by Dieter De Witte <dr...@gmail.com>.

Hi,

1) HDFS is just a file system, it hides the fact that it is distributed.
2) Mapreduce is the most lowlevel analytics tool I think, you can just
specify an input and in your map and reduce function define some
functionality to deal with this input. No need for HBase,... although they
can be extremely useful..
3) this is all in the hadoop reference: first the namenode finds a place to
allocate your data, then it gets copied to the corresponding datanode 1,
and from datanode 1 it is copied to datanode 2 (note the numbers have no
special meaning)
4) Your data will be on both datanodes. Why would that be a problem?
5) For a proof of concept I would use a ready-made virtual machine from one
of the three big vendors: cloudera, mapR or hortonworks
6) Apache version is more basic, the commercial distributions have more
built-in features, are easier to work with I guess
7) You have to install them seperately, the main reason to go for one of
the vendors maybe?

You should defintely have a look at the reference, you don't have to read
it from A-Z but it contains sections where every single sentence will
answer one of your questions..

Regards, D



2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:

> Thank you Shahab but it would be really nice if I can get some input on my
> initial question as it would really help.
>
>
> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I would suggest that given the level of details that you are looking for
>> and fundamental nature of your questions, you should get hold of books or
>> online documentation. Basically some reading/research.
>>
>> Latest edition of
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
>> highly recommended to begin with.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>
>>> Hello Team,
>>>
>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>> on my use case if Hadoop is right tool for me.
>>>
>>> I have only structured data and my goal is to safe this data into Hadoop
>>> and take benefit of replication factor. I am using Microsoft tools for
>>> doing analysis and it provides me with good drag and drop functionality for
>>> creating different kind of anaylsis and also it has hadoop drivers so it
>>> can have hadoop as data source for doing analysis.
>>>
>>> My question here is how benefits YARN architecture give me in tems of
>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>> I am just trying to understand value of introducing Hadoop in my
>>> Architecture in terms of Analysis apart from data replication. Any insights
>>> would be very helpful.
>>>
>>> Also, my goal for POC is related to efficient data storage/retrieval and
>>> so
>>>
>>>    1. how does data retrieval work in hadoop?
>>>    2. do i always need to have any kind of data source on top of hdfs
>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>    different analytic tools that have hdfs as data source?
>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>    trying to insert data into hadoop then what is the cycle that framework
>>>    performs to install my data into hdfs - does my process reads all meta data
>>>    information from master node about where is my slaves nodes and what kind
>>>    of data should go on which slave node or all data is send to master node
>>>    and from there depending upon meta data information it reads and decides
>>>    that what portion of data should be going to which node?
>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>    my data is equally distributed in two nodes and if i have replication set
>>>    to 2 then where and how will replication take place as i do not have any
>>>    node vacant for doing replication?
>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>>    opensource hadoop version and if we go with open source hadoop version then
>>>    where can we define that which is master node and which is slave node and
>>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>>    on different machines?
>>>    6. Also, what are the pros and cons with going through
>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>    view?
>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>    install them separately?
>>>
>>> Since am staring off on Hadoop Journey recently, I would really
>>> appreciate if community can point me in right direction?
>>>
>>> Regards, Andy.
>>>
>>
>>
>

Re: Use Cases for Structured Data

Posted by Shahab Yunus <sh...@gmail.com>.

Assuming that the following is your initial questions:
*"My question here is how benefits YARN architecture give me in tems of
analysis that my Microsoft, Netezza of Tableau products are not giving me.
I am just trying to understand value of introducing Hadoop in my
Architecture in terms of Analysis apart from data replication."*

YARN is not 'data storage' (hence replication does not matter her) or even
only just 'Analytics'. It is a framework that allows you to write or
implement applications which are based on distributed architecture. In
pre-Yarn world it was Map/Reduce but in YARN now you can not only work on
Map/Reduce based applications but any other distributed paradigm. YARN
helps you in taking care of lots of boiler-plate plus advance plumbing and
infrastructure concerns usually encountered with parallel distributed
applications.

I have not worked in Netezza but from what I understand, Yarn and the
distributed applications build on it are not just for Analytics. They can
be any application which wants to leverage a highly distributed and
parallel architecture and design. So basically YARN is giving you a
environment where you can implement applications using custom (or the
popular Map/Reduce) paradigm to build parallel an distributed applications.

As YARN is open source, so you have much more control over it. You can get
the code and modify it to your own needs which I presume is nota available
or possible in Netezza.

I didn't get what application you are mentioning when you said 'Microsoft'.

Others, experts can chime in.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:37 PM, ados1984@gmail.com <ad...@gmail.com>wrote:

> Thank you Shahab but it would be really nice if I can get some input on my
> initial question as it would really help.
>
>
> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I would suggest that given the level of details that you are looking for
>> and fundamental nature of your questions, you should get hold of books or
>> online documentation. Basically some reading/research.
>>
>> Latest edition of
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
>> highly recommended to begin with.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>
>>> Hello Team,
>>>
>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>> on my use case if Hadoop is right tool for me.
>>>
>>> I have only structured data and my goal is to safe this data into Hadoop
>>> and take benefit of replication factor. I am using Microsoft tools for
>>> doing analysis and it provides me with good drag and drop functionality for
>>> creating different kind of anaylsis and also it has hadoop drivers so it
>>> can have hadoop as data source for doing analysis.
>>>
>>> My question here is how benefits YARN architecture give me in tems of
>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>> I am just trying to understand value of introducing Hadoop in my
>>> Architecture in terms of Analysis apart from data replication. Any insights
>>> would be very helpful.
>>>
>>> Also, my goal for POC is related to efficient data storage/retrieval and
>>> so
>>>
>>>    1. how does data retrieval work in hadoop?
>>>    2. do i always need to have any kind of data source on top of hdfs
>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>    different analytic tools that have hdfs as data source?
>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>    trying to insert data into hadoop then what is the cycle that framework
>>>    performs to install my data into hdfs - does my process reads all meta data
>>>    information from master node about where is my slaves nodes and what kind
>>>    of data should go on which slave node or all data is send to master node
>>>    and from there depending upon meta data information it reads and decides
>>>    that what portion of data should be going to which node?
>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>    my data is equally distributed in two nodes and if i have replication set
>>>    to 2 then where and how will replication take place as i do not have any
>>>    node vacant for doing replication?
>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>>    opensource hadoop version and if we go with open source hadoop version then
>>>    where can we define that which is master node and which is slave node and
>>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>>    on different machines?
>>>    6. Also, what are the pros and cons with going through
>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>    view?
>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>    install them separately?
>>>
>>> Since am staring off on Hadoop Journey recently, I would really
>>> appreciate if community can point me in right direction?
>>>
>>> Regards, Andy.
>>>
>>
>>
>

Re: Use Cases for Structured Data

Posted by Dieter De Witte <dr...@gmail.com>.

Hi,

1) HDFS is just a file system, it hides the fact that it is distributed.
2) Mapreduce is the most lowlevel analytics tool I think, you can just
specify an input and in your map and reduce function define some
functionality to deal with this input. No need for HBase,... although they
can be extremely useful..
3) this is all in the hadoop reference: first the namenode finds a place to
allocate your data, then it gets copied to the corresponding datanode 1,
and from datanode 1 it is copied to datanode 2 (note the numbers have no
special meaning)
4) Your data will be on both datanodes. Why would that be a problem?
5) For a proof of concept I would use a ready-made virtual machine from one
of the three big vendors: cloudera, mapR or hortonworks
6) Apache version is more basic, the commercial distributions have more
built-in features, are easier to work with I guess
7) You have to install them seperately, the main reason to go for one of
the vendors maybe?

You should defintely have a look at the reference, you don't have to read
it from A-Z but it contains sections where every single sentence will
answer one of your questions..

Regards, D



2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ad...@gmail.com>:

> Thank you Shahab but it would be really nice if I can get some input on my
> initial question as it would really help.
>
>
> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I would suggest that given the level of details that you are looking for
>> and fundamental nature of your questions, you should get hold of books or
>> online documentation. Basically some reading/research.
>>
>> Latest edition of
>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
>> highly recommended to begin with.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>>
>>> Hello Team,
>>>
>>> I am starting off on Hadoop eco-system and wanted to learn first based
>>> on my use case if Hadoop is right tool for me.
>>>
>>> I have only structured data and my goal is to safe this data into Hadoop
>>> and take benefit of replication factor. I am using Microsoft tools for
>>> doing analysis and it provides me with good drag and drop functionality for
>>> creating different kind of anaylsis and also it has hadoop drivers so it
>>> can have hadoop as data source for doing analysis.
>>>
>>> My question here is how benefits YARN architecture give me in tems of
>>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>>> I am just trying to understand value of introducing Hadoop in my
>>> Architecture in terms of Analysis apart from data replication. Any insights
>>> would be very helpful.
>>>
>>> Also, my goal for POC is related to efficient data storage/retrieval and
>>> so
>>>
>>>    1. how does data retrieval work in hadoop?
>>>    2. do i always need to have any kind of data source on top of hdfs
>>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>>    my data stored in hdfs directly and can retrieve them when i need by using
>>>    different analytic tools that have hdfs as data source?
>>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>>    trying to insert data into hadoop then what is the cycle that framework
>>>    performs to install my data into hdfs - does my process reads all meta data
>>>    information from master node about where is my slaves nodes and what kind
>>>    of data should go on which slave node or all data is send to master node
>>>    and from there depending upon meta data information it reads and decides
>>>    that what portion of data should be going to which node?
>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if
>>>    my data is equally distributed in two nodes and if i have replication set
>>>    to 2 then where and how will replication take place as i do not have any
>>>    node vacant for doing replication?
>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>>    opensource hadoop version and if we go with open source hadoop version then
>>>    where can we define that which is master node and which is slave node and
>>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>>    on different machines?
>>>    6. Also, what are the pros and cons with going through
>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>>    view?
>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>>    come clubbed together with Hadoop framework and if we go with Apache
>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>>    install them separately?
>>>
>>> Since am staring off on Hadoop Journey recently, I would really
>>> appreciate if community can point me in right direction?
>>>
>>> Regards, Andy.
>>>
>>
>>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Thank you Shahab but it would be really nice if I can get some input on my
initial question as it would really help.


On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:

> I would suggest that given the level of details that you are looking for
> and fundamental nature of your questions, you should get hold of books or
> online documentation. Basically some reading/research.
>
> Latest edition of
> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
> highly recommended to begin with.
>
> Regards,
> Shahab
>
>
> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>
>> Hello Team,
>>
>> I am starting off on Hadoop eco-system and wanted to learn first based on
>> my use case if Hadoop is right tool for me.
>>
>> I have only structured data and my goal is to safe this data into Hadoop
>> and take benefit of replication factor. I am using Microsoft tools for
>> doing analysis and it provides me with good drag and drop functionality for
>> creating different kind of anaylsis and also it has hadoop drivers so it
>> can have hadoop as data source for doing analysis.
>>
>> My question here is how benefits YARN architecture give me in tems of
>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>> I am just trying to understand value of introducing Hadoop in my
>> Architecture in terms of Analysis apart from data replication. Any insights
>> would be very helpful.
>>
>> Also, my goal for POC is related to efficient data storage/retrieval and
>> so
>>
>>    1. how does data retrieval work in hadoop?
>>    2. do i always need to have any kind of data source on top of hdfs
>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>    my data stored in hdfs directly and can retrieve them when i need by using
>>    different analytic tools that have hdfs as data source?
>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>    trying to insert data into hadoop then what is the cycle that framework
>>    performs to install my data into hdfs - does my process reads all meta data
>>    information from master node about where is my slaves nodes and what kind
>>    of data should go on which slave node or all data is send to master node
>>    and from there depending upon meta data information it reads and decides
>>    that what portion of data should be going to which node?
>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
>>    data is equally distributed in two nodes and if i have replication set to 2
>>    then where and how will replication take place as i do not have any node
>>    vacant for doing replication?
>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>    opensource hadoop version and if we go with open source hadoop version then
>>    where can we define that which is master node and which is slave node and
>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>    on different machines?
>>    6. Also, what are the pros and cons with going through
>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>    view?
>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>    come clubbed together with Hadoop framework and if we go with Apache
>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>    install them separately?
>>
>> Since am staring off on Hadoop Journey recently, I would really
>> appreciate if community can point me in right direction?
>>
>> Regards, Andy.
>>
>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Thank you Shahab but it would be really nice if I can get some input on my
initial question as it would really help.


On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:

> I would suggest that given the level of details that you are looking for
> and fundamental nature of your questions, you should get hold of books or
> online documentation. Basically some reading/research.
>
> Latest edition of
> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
> highly recommended to begin with.
>
> Regards,
> Shahab
>
>
> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>
>> Hello Team,
>>
>> I am starting off on Hadoop eco-system and wanted to learn first based on
>> my use case if Hadoop is right tool for me.
>>
>> I have only structured data and my goal is to safe this data into Hadoop
>> and take benefit of replication factor. I am using Microsoft tools for
>> doing analysis and it provides me with good drag and drop functionality for
>> creating different kind of anaylsis and also it has hadoop drivers so it
>> can have hadoop as data source for doing analysis.
>>
>> My question here is how benefits YARN architecture give me in tems of
>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>> I am just trying to understand value of introducing Hadoop in my
>> Architecture in terms of Analysis apart from data replication. Any insights
>> would be very helpful.
>>
>> Also, my goal for POC is related to efficient data storage/retrieval and
>> so
>>
>>    1. how does data retrieval work in hadoop?
>>    2. do i always need to have any kind of data source on top of hdfs
>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>    my data stored in hdfs directly and can retrieve them when i need by using
>>    different analytic tools that have hdfs as data source?
>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>    trying to insert data into hadoop then what is the cycle that framework
>>    performs to install my data into hdfs - does my process reads all meta data
>>    information from master node about where is my slaves nodes and what kind
>>    of data should go on which slave node or all data is send to master node
>>    and from there depending upon meta data information it reads and decides
>>    that what portion of data should be going to which node?
>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
>>    data is equally distributed in two nodes and if i have replication set to 2
>>    then where and how will replication take place as i do not have any node
>>    vacant for doing replication?
>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>    opensource hadoop version and if we go with open source hadoop version then
>>    where can we define that which is master node and which is slave node and
>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>    on different machines?
>>    6. Also, what are the pros and cons with going through
>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>    view?
>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>    come clubbed together with Hadoop framework and if we go with Apache
>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>    install them separately?
>>
>> Since am staring off on Hadoop Journey recently, I would really
>> appreciate if community can point me in right direction?
>>
>> Regards, Andy.
>>
>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Thank you Shahab but it would be really nice if I can get some input on my
initial question as it would really help.


On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:

> I would suggest that given the level of details that you are looking for
> and fundamental nature of your questions, you should get hold of books or
> online documentation. Basically some reading/research.
>
> Latest edition of
> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
> highly recommended to begin with.
>
> Regards,
> Shahab
>
>
> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>
>> Hello Team,
>>
>> I am starting off on Hadoop eco-system and wanted to learn first based on
>> my use case if Hadoop is right tool for me.
>>
>> I have only structured data and my goal is to safe this data into Hadoop
>> and take benefit of replication factor. I am using Microsoft tools for
>> doing analysis and it provides me with good drag and drop functionality for
>> creating different kind of anaylsis and also it has hadoop drivers so it
>> can have hadoop as data source for doing analysis.
>>
>> My question here is how benefits YARN architecture give me in tems of
>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>> I am just trying to understand value of introducing Hadoop in my
>> Architecture in terms of Analysis apart from data replication. Any insights
>> would be very helpful.
>>
>> Also, my goal for POC is related to efficient data storage/retrieval and
>> so
>>
>>    1. how does data retrieval work in hadoop?
>>    2. do i always need to have any kind of data source on top of hdfs
>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>    my data stored in hdfs directly and can retrieve them when i need by using
>>    different analytic tools that have hdfs as data source?
>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>    trying to insert data into hadoop then what is the cycle that framework
>>    performs to install my data into hdfs - does my process reads all meta data
>>    information from master node about where is my slaves nodes and what kind
>>    of data should go on which slave node or all data is send to master node
>>    and from there depending upon meta data information it reads and decides
>>    that what portion of data should be going to which node?
>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
>>    data is equally distributed in two nodes and if i have replication set to 2
>>    then where and how will replication take place as i do not have any node
>>    vacant for doing replication?
>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>    opensource hadoop version and if we go with open source hadoop version then
>>    where can we define that which is master node and which is slave node and
>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>    on different machines?
>>    6. Also, what are the pros and cons with going through
>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>    view?
>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>    come clubbed together with Hadoop framework and if we go with Apache
>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>    install them separately?
>>
>> Since am staring off on Hadoop Journey recently, I would really
>> appreciate if community can point me in right direction?
>>
>> Regards, Andy.
>>
>
>

Re: Use Cases for Structured Data

Posted by "ados1984@gmail.com" <ad...@gmail.com>.

Thank you Shahab but it would be really nice if I can get some input on my
initial question as it would really help.


On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <sh...@gmail.com>wrote:

> I would suggest that given the level of details that you are looking for
> and fundamental nature of your questions, you should get hold of books or
> online documentation. Basically some reading/research.
>
> Latest edition of
> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
> highly recommended to begin with.
>
> Regards,
> Shahab
>
>
> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:
>
>> Hello Team,
>>
>> I am starting off on Hadoop eco-system and wanted to learn first based on
>> my use case if Hadoop is right tool for me.
>>
>> I have only structured data and my goal is to safe this data into Hadoop
>> and take benefit of replication factor. I am using Microsoft tools for
>> doing analysis and it provides me with good drag and drop functionality for
>> creating different kind of anaylsis and also it has hadoop drivers so it
>> can have hadoop as data source for doing analysis.
>>
>> My question here is how benefits YARN architecture give me in tems of
>> analysis that my Microsoft, Netezza of Tableau products are not giving me.
>> I am just trying to understand value of introducing Hadoop in my
>> Architecture in terms of Analysis apart from data replication. Any insights
>> would be very helpful.
>>
>> Also, my goal for POC is related to efficient data storage/retrieval and
>> so
>>
>>    1. how does data retrieval work in hadoop?
>>    2. do i always need to have any kind of data source on top of hdfs
>>    like hbase/cassandra/mongo or there is not need for one and i can have all
>>    my data stored in hdfs directly and can retrieve them when i need by using
>>    different analytic tools that have hdfs as data source?
>>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>>    trying to insert data into hadoop then what is the cycle that framework
>>    performs to install my data into hdfs - does my process reads all meta data
>>    information from master node about where is my slaves nodes and what kind
>>    of data should go on which slave node or all data is send to master node
>>    and from there depending upon meta data information it reads and decides
>>    that what portion of data should be going to which node?
>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
>>    data is equally distributed in two nodes and if i have replication set to 2
>>    then where and how will replication take place as i do not have any node
>>    vacant for doing replication?
>>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>>    opensource hadoop version and if we go with open source hadoop version then
>>    where can we define that which is master node and which is slave node and
>>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>>    on different machines?
>>    6. Also, what are the pros and cons with going through
>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>>    view?
>>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>>    come clubbed together with Hadoop framework and if we go with Apache
>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>>    install them separately?
>>
>> Since am staring off on Hadoop Journey recently, I would really
>> appreciate if community can point me in right direction?
>>
>> Regards, Andy.
>>
>
>

Re: Use Cases for Structured Data

Posted by Shahab Yunus <sh...@gmail.com>.

I would suggest that given the level of details that you are looking for
and fundamental nature of your questions, you should get hold of books or
online documentation. Basically some reading/research.

Latest edition of
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
highly recommended to begin with.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:

> Hello Team,
>
> I am starting off on Hadoop eco-system and wanted to learn first based on
> my use case if Hadoop is right tool for me.
>
> I have only structured data and my goal is to safe this data into Hadoop
> and take benefit of replication factor. I am using Microsoft tools for
> doing analysis and it provides me with good drag and drop functionality for
> creating different kind of anaylsis and also it has hadoop drivers so it
> can have hadoop as data source for doing analysis.
>
> My question here is how benefits YARN architecture give me in tems of
> analysis that my Microsoft, Netezza of Tableau products are not giving me.
> I am just trying to understand value of introducing Hadoop in my
> Architecture in terms of Analysis apart from data replication. Any insights
> would be very helpful.
>
> Also, my goal for POC is related to efficient data storage/retrieval and
> so
>
>    1. how does data retrieval work in hadoop?
>    2. do i always need to have any kind of data source on top of hdfs
>    like hbase/cassandra/mongo or there is not need for one and i can have all
>    my data stored in hdfs directly and can retrieve them when i need by using
>    different analytic tools that have hdfs as data source?
>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>    trying to insert data into hadoop then what is the cycle that framework
>    performs to install my data into hdfs - does my process reads all meta data
>    information from master node about where is my slaves nodes and what kind
>    of data should go on which slave node or all data is send to master node
>    and from there depending upon meta data information it reads and decides
>    that what portion of data should be going to which node?
>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
>    data is equally distributed in two nodes and if i have replication set to 2
>    then where and how will replication take place as i do not have any node
>    vacant for doing replication?
>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>    opensource hadoop version and if we go with open source hadoop version then
>    where can we define that which is master node and which is slave node and
>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>    on different machines?
>    6. Also, what are the pros and cons with going through
>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>    view?
>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>    come clubbed together with Hadoop framework and if we go with Apache
>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>    install them separately?
>
> Since am staring off on Hadoop Journey recently, I would really appreciate
> if community can point me in right direction?
>
> Regards, Andy.
>

Re: Use Cases for Structured Data

Posted by Shahab Yunus <sh...@gmail.com>.

I would suggest that given the level of details that you are looking for
and fundamental nature of your questions, you should get hold of books or
online documentation. Basically some reading/research.

Latest edition of
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
highly recommended to begin with.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:

> Hello Team,
>
> I am starting off on Hadoop eco-system and wanted to learn first based on
> my use case if Hadoop is right tool for me.
>
> I have only structured data and my goal is to safe this data into Hadoop
> and take benefit of replication factor. I am using Microsoft tools for
> doing analysis and it provides me with good drag and drop functionality for
> creating different kind of anaylsis and also it has hadoop drivers so it
> can have hadoop as data source for doing analysis.
>
> My question here is how benefits YARN architecture give me in tems of
> analysis that my Microsoft, Netezza of Tableau products are not giving me.
> I am just trying to understand value of introducing Hadoop in my
> Architecture in terms of Analysis apart from data replication. Any insights
> would be very helpful.
>
> Also, my goal for POC is related to efficient data storage/retrieval and
> so
>
>    1. how does data retrieval work in hadoop?
>    2. do i always need to have any kind of data source on top of hdfs
>    like hbase/cassandra/mongo or there is not need for one and i can have all
>    my data stored in hdfs directly and can retrieve them when i need by using
>    different analytic tools that have hdfs as data source?
>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>    trying to insert data into hadoop then what is the cycle that framework
>    performs to install my data into hdfs - does my process reads all meta data
>    information from master node about where is my slaves nodes and what kind
>    of data should go on which slave node or all data is send to master node
>    and from there depending upon meta data information it reads and decides
>    that what portion of data should be going to which node?
>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
>    data is equally distributed in two nodes and if i have replication set to 2
>    then where and how will replication take place as i do not have any node
>    vacant for doing replication?
>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>    opensource hadoop version and if we go with open source hadoop version then
>    where can we define that which is master node and which is slave node and
>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>    on different machines?
>    6. Also, what are the pros and cons with going through
>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>    view?
>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>    come clubbed together with Hadoop framework and if we go with Apache
>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>    install them separately?
>
> Since am staring off on Hadoop Journey recently, I would really appreciate
> if community can point me in right direction?
>
> Regards, Andy.
>

Re: Use Cases for Structured Data

Posted by Shahab Yunus <sh...@gmail.com>.

I would suggest that given the level of details that you are looking for
and fundamental nature of your questions, you should get hold of books or
online documentation. Basically some reading/research.

Latest edition of
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
highly recommended to begin with.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:

> Hello Team,
>
> I am starting off on Hadoop eco-system and wanted to learn first based on
> my use case if Hadoop is right tool for me.
>
> I have only structured data and my goal is to safe this data into Hadoop
> and take benefit of replication factor. I am using Microsoft tools for
> doing analysis and it provides me with good drag and drop functionality for
> creating different kind of anaylsis and also it has hadoop drivers so it
> can have hadoop as data source for doing analysis.
>
> My question here is how benefits YARN architecture give me in tems of
> analysis that my Microsoft, Netezza of Tableau products are not giving me.
> I am just trying to understand value of introducing Hadoop in my
> Architecture in terms of Analysis apart from data replication. Any insights
> would be very helpful.
>
> Also, my goal for POC is related to efficient data storage/retrieval and
> so
>
>    1. how does data retrieval work in hadoop?
>    2. do i always need to have any kind of data source on top of hdfs
>    like hbase/cassandra/mongo or there is not need for one and i can have all
>    my data stored in hdfs directly and can retrieve them when i need by using
>    different analytic tools that have hdfs as data source?
>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>    trying to insert data into hadoop then what is the cycle that framework
>    performs to install my data into hdfs - does my process reads all meta data
>    information from master node about where is my slaves nodes and what kind
>    of data should go on which slave node or all data is send to master node
>    and from there depending upon meta data information it reads and decides
>    that what portion of data should be going to which node?
>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
>    data is equally distributed in two nodes and if i have replication set to 2
>    then where and how will replication take place as i do not have any node
>    vacant for doing replication?
>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>    opensource hadoop version and if we go with open source hadoop version then
>    where can we define that which is master node and which is slave node and
>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>    on different machines?
>    6. Also, what are the pros and cons with going through
>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>    view?
>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>    come clubbed together with Hadoop framework and if we go with Apache
>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>    install them separately?
>
> Since am staring off on Hadoop Journey recently, I would really appreciate
> if community can point me in right direction?
>
> Regards, Andy.
>

Re: Use Cases for Structured Data

Posted by Shahab Yunus <sh...@gmail.com>.

I would suggest that given the level of details that you are looking for
and fundamental nature of your questions, you should get hold of books or
online documentation. Basically some reading/research.

Latest edition of
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is
highly recommended to begin with.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ad...@gmail.com>wrote:

> Hello Team,
>
> I am starting off on Hadoop eco-system and wanted to learn first based on
> my use case if Hadoop is right tool for me.
>
> I have only structured data and my goal is to safe this data into Hadoop
> and take benefit of replication factor. I am using Microsoft tools for
> doing analysis and it provides me with good drag and drop functionality for
> creating different kind of anaylsis and also it has hadoop drivers so it
> can have hadoop as data source for doing analysis.
>
> My question here is how benefits YARN architecture give me in tems of
> analysis that my Microsoft, Netezza of Tableau products are not giving me.
> I am just trying to understand value of introducing Hadoop in my
> Architecture in terms of Analysis apart from data replication. Any insights
> would be very helpful.
>
> Also, my goal for POC is related to efficient data storage/retrieval and
> so
>
>    1. how does data retrieval work in hadoop?
>    2. do i always need to have any kind of data source on top of hdfs
>    like hbase/cassandra/mongo or there is not need for one and i can have all
>    my data stored in hdfs directly and can retrieve them when i need by using
>    different analytic tools that have hdfs as data source?
>    3. say if i have 3 node cluster, one master and 2 slaves and if am
>    trying to insert data into hadoop then what is the cycle that framework
>    performs to install my data into hdfs - does my process reads all meta data
>    information from master node about where is my slaves nodes and what kind
>    of data should go on which slave node or all data is send to master node
>    and from there depending upon meta data information it reads and decides
>    that what portion of data should be going to which node?
>    4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
>    data is equally distributed in two nodes and if i have replication set to 2
>    then where and how will replication take place as i do not have any node
>    vacant for doing replication?
>    5. Also, for POC, does it make sense to go with Cloudera 3 node free
>    cluster or Hortonworks 3 node free cluster or it makes sense to go with
>    opensource hadoop version and if we go with open source hadoop version then
>    where can we define that which is master node and which is slave node and
>    also can we have all 3 nodes on same machine or we need to have all 3 nodes
>    on different machines?
>    6. Also, what are the pros and cons with going through
>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
>    view?
>    7. Also, if we go with Hortonworks/Cloudera then what all tools are
>    come clubbed together with Hadoop framework and if we go with Apache
>    Hadoop, do we get any tools like Pig, Hive clubbed together or we have to
>    install them separately?
>
> Since am staring off on Hadoop Journey recently, I would really appreciate
> if community can point me in right direction?
>
> Regards, Andy.
>