You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by samir das mohapatra <sa...@gmail.com> on 2013/03/05 06:27:57 UTC

How to solve one Scenario in hadoop ?

Hi All,
   I have  one scenario  where our organization is trying to implement
hadoop.

Scenario Statement:

---------------------------------------

    Supoose  we have variouse data sources , for example RDBMS, HDFS,
Streaming .


*Source Dataset Types :*

 1.Single Source

2.Joining Sources

3.Filtered Data set

4.Specific columns


We nee to pull the data from one source to other , it could be from HDFS to
RDBMS or vice versa based on condition , that means out of whole data from
source  we need only the specific data,whole data,join data  into the
destination . So which direction we should go to pull the data based on the
above dataset type condition.


I am thinking .

 CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
           :-  we will use *distcp  *

CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
(Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will do
the filter operation then load*

CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*

CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*

CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
load into HDFS : *JDBC WITH Map/Reduce program*


Note: Can any one suggest me, if I am wrong and we need to do something
other then this, which will be easy to do .


Regards,

samir.

Re: How to solve one Scenario in hadoop ?

Posted by Dino Kečo <di...@gmail.com>.

I would sugest Hive in these cases because it is easy to manage multiple
data sources, it uses SQL like syntax, it scales because of Hadoop and it
has joining implemented and optimized

Regards
Dino
On Mar 6, 2013 8:46 PM, "Vikas Jadhav" <vi...@gmail.com> wrote:

> I will go with first case because if data size is large then it will
> distribute data across multiple nodes.
>
>
> On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra <
> samir.helpdoc@gmail.com> wrote:
>
>> Hi All,
>>    I have  one scenario  where our organization is trying to implement
>> hadoop.
>>
>> Scenario Statement:
>>
>> ---------------------------------------
>>
>>     Supoose  we have variouse data sources , for example RDBMS, HDFS,
>> Streaming .
>>
>>
>>  *Source Dataset Types :*
>>
>>  1.Single Source
>>
>> 2.Joining Sources
>>
>> 3.Filtered Data set
>>
>> 4.Specific columns
>>
>>
>> We nee to pull the data from one source to other , it could be from HDFS
>> to RDBMS or vice versa based on condition , that means out of whole data
>> from source  we need only the specific data,whole data,join data  into the
>> destination . So which direction we should go to pull the data based on the
>> above dataset type condition.
>>
>>
>> I am thinking .
>>
>>  CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
>>              :-  we will use *distcp  *
>>
>> CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
>> (Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will
>> do the filter operation then load*
>>
>> CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*
>>
>> CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*
>>
>> CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
>> load into HDFS : *JDBC WITH Map/Reduce program*
>>
>>
>> Note: Can any one suggest me, if I am wrong and we need to do something
>> other then this, which will be easy to do .
>>
>>
>> Regards,
>>
>> samir.
>>
>>
>>
>>
>
>
> --
> *
> *
> *
>
> Thanx and Regards*
> * Vikas Jadhav*
>

Re: How to solve one Scenario in hadoop ?

Posted by Dino Kečo <di...@gmail.com>.

I would sugest Hive in these cases because it is easy to manage multiple
data sources, it uses SQL like syntax, it scales because of Hadoop and it
has joining implemented and optimized

Regards
Dino
On Mar 6, 2013 8:46 PM, "Vikas Jadhav" <vi...@gmail.com> wrote:

> I will go with first case because if data size is large then it will
> distribute data across multiple nodes.
>
>
> On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra <
> samir.helpdoc@gmail.com> wrote:
>
>> Hi All,
>>    I have  one scenario  where our organization is trying to implement
>> hadoop.
>>
>> Scenario Statement:
>>
>> ---------------------------------------
>>
>>     Supoose  we have variouse data sources , for example RDBMS, HDFS,
>> Streaming .
>>
>>
>>  *Source Dataset Types :*
>>
>>  1.Single Source
>>
>> 2.Joining Sources
>>
>> 3.Filtered Data set
>>
>> 4.Specific columns
>>
>>
>> We nee to pull the data from one source to other , it could be from HDFS
>> to RDBMS or vice versa based on condition , that means out of whole data
>> from source  we need only the specific data,whole data,join data  into the
>> destination . So which direction we should go to pull the data based on the
>> above dataset type condition.
>>
>>
>> I am thinking .
>>
>>  CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
>>              :-  we will use *distcp  *
>>
>> CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
>> (Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will
>> do the filter operation then load*
>>
>> CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*
>>
>> CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*
>>
>> CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
>> load into HDFS : *JDBC WITH Map/Reduce program*
>>
>>
>> Note: Can any one suggest me, if I am wrong and we need to do something
>> other then this, which will be easy to do .
>>
>>
>> Regards,
>>
>> samir.
>>
>>
>>
>>
>
>
> --
> *
> *
> *
>
> Thanx and Regards*
> * Vikas Jadhav*
>

Re: How to solve one Scenario in hadoop ?

Posted by Dino Kečo <di...@gmail.com>.

I would sugest Hive in these cases because it is easy to manage multiple
data sources, it uses SQL like syntax, it scales because of Hadoop and it
has joining implemented and optimized

Regards
Dino
On Mar 6, 2013 8:46 PM, "Vikas Jadhav" <vi...@gmail.com> wrote:

> I will go with first case because if data size is large then it will
> distribute data across multiple nodes.
>
>
> On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra <
> samir.helpdoc@gmail.com> wrote:
>
>> Hi All,
>>    I have  one scenario  where our organization is trying to implement
>> hadoop.
>>
>> Scenario Statement:
>>
>> ---------------------------------------
>>
>>     Supoose  we have variouse data sources , for example RDBMS, HDFS,
>> Streaming .
>>
>>
>>  *Source Dataset Types :*
>>
>>  1.Single Source
>>
>> 2.Joining Sources
>>
>> 3.Filtered Data set
>>
>> 4.Specific columns
>>
>>
>> We nee to pull the data from one source to other , it could be from HDFS
>> to RDBMS or vice versa based on condition , that means out of whole data
>> from source  we need only the specific data,whole data,join data  into the
>> destination . So which direction we should go to pull the data based on the
>> above dataset type condition.
>>
>>
>> I am thinking .
>>
>>  CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
>>              :-  we will use *distcp  *
>>
>> CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
>> (Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will
>> do the filter operation then load*
>>
>> CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*
>>
>> CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*
>>
>> CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
>> load into HDFS : *JDBC WITH Map/Reduce program*
>>
>>
>> Note: Can any one suggest me, if I am wrong and we need to do something
>> other then this, which will be easy to do .
>>
>>
>> Regards,
>>
>> samir.
>>
>>
>>
>>
>
>
> --
> *
> *
> *
>
> Thanx and Regards*
> * Vikas Jadhav*
>

Re: How to solve one Scenario in hadoop ?

Posted by Dino Kečo <di...@gmail.com>.

I would sugest Hive in these cases because it is easy to manage multiple
data sources, it uses SQL like syntax, it scales because of Hadoop and it
has joining implemented and optimized

Regards
Dino
On Mar 6, 2013 8:46 PM, "Vikas Jadhav" <vi...@gmail.com> wrote:

> I will go with first case because if data size is large then it will
> distribute data across multiple nodes.
>
>
> On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra <
> samir.helpdoc@gmail.com> wrote:
>
>> Hi All,
>>    I have  one scenario  where our organization is trying to implement
>> hadoop.
>>
>> Scenario Statement:
>>
>> ---------------------------------------
>>
>>     Supoose  we have variouse data sources , for example RDBMS, HDFS,
>> Streaming .
>>
>>
>>  *Source Dataset Types :*
>>
>>  1.Single Source
>>
>> 2.Joining Sources
>>
>> 3.Filtered Data set
>>
>> 4.Specific columns
>>
>>
>> We nee to pull the data from one source to other , it could be from HDFS
>> to RDBMS or vice versa based on condition , that means out of whole data
>> from source  we need only the specific data,whole data,join data  into the
>> destination . So which direction we should go to pull the data based on the
>> above dataset type condition.
>>
>>
>> I am thinking .
>>
>>  CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
>>              :-  we will use *distcp  *
>>
>> CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
>> (Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will
>> do the filter operation then load*
>>
>> CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*
>>
>> CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*
>>
>> CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
>> load into HDFS : *JDBC WITH Map/Reduce program*
>>
>>
>> Note: Can any one suggest me, if I am wrong and we need to do something
>> other then this, which will be easy to do .
>>
>>
>> Regards,
>>
>> samir.
>>
>>
>>
>>
>
>
> --
> *
> *
> *
>
> Thanx and Regards*
> * Vikas Jadhav*
>

Re: How to solve one Scenario in hadoop ?

Posted by Vikas Jadhav <vi...@gmail.com>.

I will go with first case because if data size is large then it will
distribute data across multiple nodes.


On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra <
samir.helpdoc@gmail.com> wrote:

> Hi All,
>    I have  one scenario  where our organization is trying to implement
> hadoop.
>
> Scenario Statement:
>
> ---------------------------------------
>
>     Supoose  we have variouse data sources , for example RDBMS, HDFS,
> Streaming .
>
>
>  *Source Dataset Types :*
>
>  1.Single Source
>
> 2.Joining Sources
>
> 3.Filtered Data set
>
> 4.Specific columns
>
>
> We nee to pull the data from one source to other , it could be from HDFS
> to RDBMS or vice versa based on condition , that means out of whole data
> from source  we need only the specific data,whole data,join data  into the
> destination . So which direction we should go to pull the data based on the
> above dataset type condition.
>
>
> I am thinking .
>
>  CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
>            :-  we will use *distcp  *
>
> CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
> (Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will
> do the filter operation then load*
>
> CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*
>
> CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*
>
> CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
> load into HDFS : *JDBC WITH Map/Reduce program*
>
>
> Note: Can any one suggest me, if I am wrong and we need to do something
> other then this, which will be easy to do .
>
>
> Regards,
>
> samir.
>
>
>
>


-- 
*
*
*

Thanx and Regards*
* Vikas Jadhav*

Re: How to solve one Scenario in hadoop ?

Posted by Vikas Jadhav <vi...@gmail.com>.

I will go with first case because if data size is large then it will
distribute data across multiple nodes.


On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra <
samir.helpdoc@gmail.com> wrote:

> Hi All,
>    I have  one scenario  where our organization is trying to implement
> hadoop.
>
> Scenario Statement:
>
> ---------------------------------------
>
>     Supoose  we have variouse data sources , for example RDBMS, HDFS,
> Streaming .
>
>
>  *Source Dataset Types :*
>
>  1.Single Source
>
> 2.Joining Sources
>
> 3.Filtered Data set
>
> 4.Specific columns
>
>
> We nee to pull the data from one source to other , it could be from HDFS
> to RDBMS or vice versa based on condition , that means out of whole data
> from source  we need only the specific data,whole data,join data  into the
> destination . So which direction we should go to pull the data based on the
> above dataset type condition.
>
>
> I am thinking .
>
>  CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
>            :-  we will use *distcp  *
>
> CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
> (Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will
> do the filter operation then load*
>
> CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*
>
> CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*
>
> CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
> load into HDFS : *JDBC WITH Map/Reduce program*
>
>
> Note: Can any one suggest me, if I am wrong and we need to do something
> other then this, which will be easy to do .
>
>
> Regards,
>
> samir.
>
>
>
>


-- 
*
*
*

Thanx and Regards*
* Vikas Jadhav*

Re: How to solve one Scenario in hadoop ?

Posted by Vikas Jadhav <vi...@gmail.com>.

I will go with first case because if data size is large then it will
distribute data across multiple nodes.


On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra <
samir.helpdoc@gmail.com> wrote:

> Hi All,
>    I have  one scenario  where our organization is trying to implement
> hadoop.
>
> Scenario Statement:
>
> ---------------------------------------
>
>     Supoose  we have variouse data sources , for example RDBMS, HDFS,
> Streaming .
>
>
>  *Source Dataset Types :*
>
>  1.Single Source
>
> 2.Joining Sources
>
> 3.Filtered Data set
>
> 4.Specific columns
>
>
> We nee to pull the data from one source to other , it could be from HDFS
> to RDBMS or vice versa based on condition , that means out of whole data
> from source  we need only the specific data,whole data,join data  into the
> destination . So which direction we should go to pull the data based on the
> above dataset type condition.
>
>
> I am thinking .
>
>  CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
>            :-  we will use *distcp  *
>
> CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
> (Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will
> do the filter operation then load*
>
> CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*
>
> CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*
>
> CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
> load into HDFS : *JDBC WITH Map/Reduce program*
>
>
> Note: Can any one suggest me, if I am wrong and we need to do something
> other then this, which will be easy to do .
>
>
> Regards,
>
> samir.
>
>
>
>


-- 
*
*
*

Thanx and Regards*
* Vikas Jadhav*

Re: How to solve one Scenario in hadoop ?

Posted by Vikas Jadhav <vi...@gmail.com>.

I will go with first case because if data size is large then it will
distribute data across multiple nodes.


On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra <
samir.helpdoc@gmail.com> wrote:

> Hi All,
>    I have  one scenario  where our organization is trying to implement
> hadoop.
>
> Scenario Statement:
>
> ---------------------------------------
>
>     Supoose  we have variouse data sources , for example RDBMS, HDFS,
> Streaming .
>
>
>  *Source Dataset Types :*
>
>  1.Single Source
>
> 2.Joining Sources
>
> 3.Filtered Data set
>
> 4.Specific columns
>
>
> We nee to pull the data from one source to other , it could be from HDFS
> to RDBMS or vice versa based on condition , that means out of whole data
> from source  we need only the specific data,whole data,join data  into the
> destination . So which direction we should go to pull the data based on the
> above dataset type condition.
>
>
> I am thinking .
>
>  CASE-1   DATA  from HDFS to HDFS (different cluster) whole data
>            :-  we will use *distcp  *
>
> CASE-2    DATA  from HDFS to HDFS (different cluster) conditional data
> (Filter data) :-  we will use  *CUSTOM MAP REDUCE PROGRAM Where we will
> do the filter operation then load*
>
> CASE-3    DATA from HDFS to RDBMS(Whole data): *SQOOP*
>
> CASE-4   DATA from HDFS to RDBMS(conditional data): *SQOOP*
>
> CASE-5   SOME DATA  FROM RDBMS and SOME DATA FROM HDFS then do filter and
> load into HDFS : *JDBC WITH Map/Reduce program*
>
>
> Note: Can any one suggest me, if I am wrong and we need to do something
> other then this, which will be easy to do .
>
>
> Regards,
>
> samir.
>
>
>
>


-- 
*
*
*

Thanx and Regards*
* Vikas Jadhav*