You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Bhavesh Shah <bh...@gmail.com> on 2012/05/14 09:09:27 UTC

Is my Use Case possible with Hive?

Hello all,
My Use Case is:
1) I have a relational database which has a very large data. (MS SQL Server)
2) I want to do analysis on these huge data  and want to generate reports
on it after analysis.
Like this I have to generate various reports based on different analysis.

I tried to implement this using Hive. What I did is:
1) I imported all tables in Hive from MS SQL Server using SQOOP.
2) I wrote many queries in Hive which is executing using JDBC on Hive
Thrift Server
3) I am getting the correct result in table form, which I am expecting
4) But the problem is that the time which require to execute is too much
long.
    (My complete program is executing in near about 3-4 hours on *small
amount of data*).

    I decided to do this using Hive.
     And as I told previously how much time Hive consumed for execution. my
organization is expecting to complete this task in near about less than
1/2 hours

Now after spending too much time for complete execution for this task what
should I do?
I want to ask one thing that:
*Is this Use Case is possible with Hive?* If possible what should I do in
my program to increase the performance?
*And If not possible what is the other good way to implement this Use Case?*

Please reply me.
Thanks


-- 
Regards,
Bhavesh Shah

Re: Is my Use Case possible with Hive?

Posted by Nitin Pawar <ni...@gmail.com>.

the problem with hive server with jdbc currently is that it does not handle
concurrent connection in a seamless manner and chokes down on larger number
of parallel query executions.

For this one reason, I had actually written a pipeline kind of infra using
shell scripts which used to run queries after one another and used to run
them from different terminals or run them as background processes (but this
needed a larger memory on hive client cli as lot of times hive cli went OOM
as too many queries were doing some pre query processing (like mapside
joins etc)

On Tue, May 15, 2012 at 5:03 PM, Bhavesh Shah <bh...@gmail.com>wrote:

> Thanks all for their replies.
> Just now I tried one thing that as folows:
> 1) I open tho two hive CLI.  hive>
> 2) I have one query which takes 7 jobs for execution. I submitted that
> query to both the CLI.
> 3) one of the hive CLI took 147.319 seconds  and second one took: 161.542
> seconds
> 4) Later I tried that query only on one CLI and it took 122.307 seconds
>  The thing what I want to ask is this, if multiple query runs parallel it
> takes less time to execute compare to execute one by one.
>
>   If I want to execute such parallel queries through JDBC, how can I do it.
>   I know that hive can accept at a time one connection. But still is
> there any way to so it?
>   Pls suggest me some solution for this.
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 15, 2012 at 1:15 AM, Nanda Vijaydev <na...@gmail.com>wrote:
>
>> Hadoop in general does well with fewer large data files instead of more
>> smaller data files. RDBMS type of indexing and run time optimization is not
>> exactly available in Hadoop/Hive yet. So one suggestion is to combine some
>> of this data, if you can, into fewer tables as you are doing sqoop. Even if
>> there is a slight redundancy it should be OK. Storage is cheap and helps
>> during read.
>>
>> Other suggestions as given in this thread is to set map side and reduce
>> side hive optimization parameters. Querying via jdbc is generally slow as
>> well. There are certain products in Hadoop space that allow for hive
>> querying without jdbc interface. Give it a try and it should improve
>> performance.
>>
>> Good luck
>>
>>
>>
>> On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <bh...@gmail.com>wrote:
>>
>>> Thanks Nitin for your continous support.
>>> *Here is my data layout and change the queries as per needed*:
>>> 1) Initially after importing the tables from MS SQL Server, 1st basic
>>> task I am doing is that *PIVOTING.*
>>>    As SQL stores data in name value pair.
>>> 2) Pivoting results in subset of data, Using this subset we are running
>>> complex queries on history data and retrieves result for each row in
>>> subset.
>>>     again *data is updated into pivoted columns*. (I am not using
>>> partition. updated by INSERT OVERWRITE)
>>>     As update is not supporting, I have to again do *INSERT OVERWRITE
>>> TABLE
>>> *3) Likewise I have to do near about 20-30 times. (Depends upon
>>> Business rules and scenario if needed to Business rules)
>>> 4) After this I have to do computation which has very large queries from
>>> above generated tables.
>>>     (Each query has near about 10-11 jobs query)
>>>     This again repeats for 30 times.
>>>
>>> (My all queries contains -  case when, group by, cast function, etc )
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> partitioning is mainly used when you want to access the table based on
>>>> value of a particular column and dont want to go through entire table for
>>>> same operation. This actually means if there are few columns whose values
>>>> are repeated in all the records, then you can consider partitioning on
>>>> them. Other approach will be partition data based on date/time if
>>>> applicable.
>>>>
>>>> From the queries you showed, i am just seeing inserting and creating
>>>> indexes. loading data to tables should not take much time and I personally
>>>> have never used indexing so can not tell about that particular query
>>>> execution time.
>>>>
>>>> if I understand correctly following is your execution approach
>>>>
>>>> 1) Import data from MS-SQL to hive using sqoop
>>>>     should be over quickly depending on how much time MS-SQL takes to
>>>> export
>>>> 2) example of queries which you are doing on the data being dumped in
>>>> hive will be good to know if we can decide on the data layout and change
>>>> the queries as per needed if needed
>>>> 3) Once query execution is over you are putting the result back in
>>>> MS-SQL
>>>>
>>>> can you note individually how much time each step is taking?
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>>>
>>>>> Hello Nitin,
>>>>> Thanks for suggesting me about the partition.
>>>>> But I want to tell one thing that I forgot to mention before is that :
>>>>> *
>>>>> I am using Indexes on all tables tables which are used again and
>>>>> again. *
>>>>> But the problem is that after execution I didn't see the difference in
>>>>> performance (before applying the index and after applying it)
>>>>> I have created the indexes as below:
>>>>> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
>>>>> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
>>>>> res2 = stmt2.executeQuery(sql);
>>>>> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
>>>>> C1.VisitDate, C1.ID from
>>>>>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
>>>>> C1.VisitDate=T.VisitDate").toString();
>>>>> stmt2.executeUpdate(sql);
>>>>> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into
>>>>> table TmpElementTable";
>>>>> stmt2.executeUpdate(sql);
>>>>> sql = "alter index clinical_index on TmpElementTable REBUILD";
>>>>> res2 = stmt2.executeQuery(sql);
>>>>> *Did I use it in correct way?*
>>>>>
>>>>> As you told me told me to try with partition
>>>>> Actually I am altering the table with large number of columns at the
>>>>> runtime only.
>>>>> If i use partition in such situation then is it good to use partition
>>>>> for all columns?
>>>>>
>>>>> So, I want to know that After using the partition Will it be able to
>>>>> improve the performance or
>>>>> do I need to use both Partition and Indexes?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bhavesh Shah
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> it is definitely possible to increase your performance.
>>>>>>
>>>>>> I have run queries where more than 10 billion records were involved.
>>>>>> If you are doing joins in your queries, you may have a look at
>>>>>> different kind of joins supported by hive.
>>>>>> If one of your table is very small in size compared to another table
>>>>>> then you may consider mapside join etc
>>>>>>
>>>>>> Also the number of maps and reducers are decided by the split size
>>>>>> you provide to maps.
>>>>>>
>>>>>> I would suggest before you go full speed, decide on how you want to
>>>>>> layout data for hive.
>>>>>>
>>>>>> You can try loading some data, partition the data and write queries
>>>>>> based on partition then performance will improve but in that case your
>>>>>> queries will be in batch processing format. there are other approaches as
>>>>>> well.
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <
>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>
>>>>>>> That I fail to know, how many maps and reducers are there. Because
>>>>>>> due to some reason my instance get terminated   :(
>>>>>>> I want to know one thing that If we use multiple nodes, then what
>>>>>>> should be the count of maps and reducers.
>>>>>>> Actually I am confused about that. How to decide it?
>>>>>>>
>>>>>>> Also I want to try the different properties like block size,
>>>>>>> compress output, size of in-memorybuffer, parallel execution etc.
>>>>>>> Will these all properties matters to increase the performance?
>>>>>>>
>>>>>>> Nitin, you have read all my use case. Whatever the thing I did to
>>>>>>> implement with the help of Hadoop is correct?
>>>>>>> Is it possible to increase the performance?
>>>>>>>
>>>>>>> Thanks Nitin for your reply.   :)
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Bhavesh Shah
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> with a 10 node cluster the performance should improve.
>>>>>>>> how many maps and reducers are being launched?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <
>>>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I have near about 1 billion records in my relational database.
>>>>>>>>> Currently locally I am using just one cluster. But I also tried
>>>>>>>>> this on Amazon Elastic Mapreduce with 10 nodes. But the time taken to
>>>>>>>>> execute the complete program is same as that on my  single local machine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> how many # records?
>>>>>>>>>>
>>>>>>>>>> what is your hadoop cluster setup? how many nodes?
>>>>>>>>>> if you are running hadoop on a single node setup with normal
>>>>>>>>>> desktop, i doubt it will be of any help.
>>>>>>>>>>
>>>>>>>>>> You need a stronger cluster setup for better query runtimes and
>>>>>>>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>> My Use Case is:
>>>>>>>>>>> 1) I have a relational database which has a very large data. (MS
>>>>>>>>>>> SQL Server)
>>>>>>>>>>> 2) I want to do analysis on these huge data  and want to
>>>>>>>>>>> generate reports
>>>>>>>>>>> on it after analysis.
>>>>>>>>>>> Like this I have to generate various reports based on different
>>>>>>>>>>> analysis.
>>>>>>>>>>>
>>>>>>>>>>> I tried to implement this using Hive. What I did is:
>>>>>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on
>>>>>>>>>>> Hive
>>>>>>>>>>> Thrift Server
>>>>>>>>>>> 3) I am getting the correct result in table form, which I am
>>>>>>>>>>> expecting
>>>>>>>>>>> 4) But the problem is that the time which require to execute is
>>>>>>>>>>> too much
>>>>>>>>>>> long.
>>>>>>>>>>>    (My complete program is executing in near about 3-4 hours on
>>>>>>>>>>> *small
>>>>>>>>>>> amount of data*).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    I decided to do this using Hive.
>>>>>>>>>>>     And as I told previously how much time Hive consumed for
>>>>>>>>>>> execution. my
>>>>>>>>>>> organization is expecting to complete this task in near about
>>>>>>>>>>> less than
>>>>>>>>>>> 1/2 hours
>>>>>>>>>>>
>>>>>>>>>>> Now after spending too much time for complete execution for this
>>>>>>>>>>> task what
>>>>>>>>>>> should I do?
>>>>>>>>>>> I want to ask one thing that:
>>>>>>>>>>> *Is this Use Case is possible with Hive?* If possible what
>>>>>>>>>>> should I do in
>>>>>>>>>>>
>>>>>>>>>>> my program to increase the performance?
>>>>>>>>>>> *And If not possible what is the other good way to implement
>>>>>>>>>>> this Use Case?*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please reply me.
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>> Bhavesh Shah
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> Bhavesh Shah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>


-- 
Nitin Pawar

Re: Is my Use Case possible with Hive?

Posted by Bhavesh Shah <bh...@gmail.com>.

Thanks all for their replies.
Just now I tried one thing that as folows:
1) I open tho two hive CLI.  hive>
2) I have one query which takes 7 jobs for execution. I submitted that
query to both the CLI.
3) one of the hive CLI took 147.319 seconds  and second one took: 161.542
seconds
4) Later I tried that query only on one CLI and it took 122.307 seconds
 The thing what I want to ask is this, if multiple query runs parallel it
takes less time to execute compare to execute one by one.

  If I want to execute such parallel queries through JDBC, how can I do it.
  I know that hive can accept at a time one connection. But still is there
any way to so it?
  Pls suggest me some solution for this.


-- 
Regards,
Bhavesh Shah


On Tue, May 15, 2012 at 1:15 AM, Nanda Vijaydev <na...@gmail.com>wrote:

> Hadoop in general does well with fewer large data files instead of more
> smaller data files. RDBMS type of indexing and run time optimization is not
> exactly available in Hadoop/Hive yet. So one suggestion is to combine some
> of this data, if you can, into fewer tables as you are doing sqoop. Even if
> there is a slight redundancy it should be OK. Storage is cheap and helps
> during read.
>
> Other suggestions as given in this thread is to set map side and reduce
> side hive optimization parameters. Querying via jdbc is generally slow as
> well. There are certain products in Hadoop space that allow for hive
> querying without jdbc interface. Give it a try and it should improve
> performance.
>
> Good luck
>
>
>
> On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <bh...@gmail.com>wrote:
>
>> Thanks Nitin for your continous support.
>> *Here is my data layout and change the queries as per needed*:
>> 1) Initially after importing the tables from MS SQL Server, 1st basic
>> task I am doing is that *PIVOTING.*
>>    As SQL stores data in name value pair.
>> 2) Pivoting results in subset of data, Using this subset we are running
>> complex queries on history data and retrieves result for each row in
>> subset.
>>     again *data is updated into pivoted columns*. (I am not using
>> partition. updated by INSERT OVERWRITE)
>>     As update is not supporting, I have to again do *INSERT OVERWRITE
>> TABLE
>> *3) Likewise I have to do near about 20-30 times. (Depends upon Business
>> rules and scenario if needed to Business rules)
>> 4) After this I have to do computation which has very large queries from
>> above generated tables.
>>     (Each query has near about 10-11 jobs query)
>>     This again repeats for 30 times.
>>
>> (My all queries contains -  case when, group by, cast function, etc )
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> partitioning is mainly used when you want to access the table based on
>>> value of a particular column and dont want to go through entire table for
>>> same operation. This actually means if there are few columns whose values
>>> are repeated in all the records, then you can consider partitioning on
>>> them. Other approach will be partition data based on date/time if
>>> applicable.
>>>
>>> From the queries you showed, i am just seeing inserting and creating
>>> indexes. loading data to tables should not take much time and I personally
>>> have never used indexing so can not tell about that particular query
>>> execution time.
>>>
>>> if I understand correctly following is your execution approach
>>>
>>> 1) Import data from MS-SQL to hive using sqoop
>>>     should be over quickly depending on how much time MS-SQL takes to
>>> export
>>> 2) example of queries which you are doing on the data being dumped in
>>> hive will be good to know if we can decide on the data layout and change
>>> the queries as per needed if needed
>>> 3) Once query execution is over you are putting the result back in
>>> MS-SQL
>>>
>>> can you note individually how much time each step is taking?
>>>
>>>
>>> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>>
>>>> Hello Nitin,
>>>> Thanks for suggesting me about the partition.
>>>> But I want to tell one thing that I forgot to mention before is that :*
>>>> I am using Indexes on all tables tables which are used again and again.
>>>> *
>>>> But the problem is that after execution I didn't see the difference in
>>>> performance (before applying the index and after applying it)
>>>> I have created the indexes as below:
>>>> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
>>>> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
>>>> res2 = stmt2.executeQuery(sql);
>>>> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
>>>> C1.VisitDate, C1.ID from
>>>>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
>>>> C1.VisitDate=T.VisitDate").toString();
>>>> stmt2.executeUpdate(sql);
>>>> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
>>>> TmpElementTable";
>>>> stmt2.executeUpdate(sql);
>>>> sql = "alter index clinical_index on TmpElementTable REBUILD";
>>>> res2 = stmt2.executeQuery(sql);
>>>> *Did I use it in correct way?*
>>>>
>>>> As you told me told me to try with partition
>>>> Actually I am altering the table with large number of columns at the
>>>> runtime only.
>>>> If i use partition in such situation then is it good to use partition
>>>> for all columns?
>>>>
>>>> So, I want to know that After using the partition Will it be able to
>>>> improve the performance or
>>>> do I need to use both Partition and Indexes?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bhavesh Shah
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> it is definitely possible to increase your performance.
>>>>>
>>>>> I have run queries where more than 10 billion records were involved.
>>>>> If you are doing joins in your queries, you may have a look at
>>>>> different kind of joins supported by hive.
>>>>> If one of your table is very small in size compared to another table
>>>>> then you may consider mapside join etc
>>>>>
>>>>> Also the number of maps and reducers are decided by the split size you
>>>>> provide to maps.
>>>>>
>>>>> I would suggest before you go full speed, decide on how you want to
>>>>> layout data for hive.
>>>>>
>>>>> You can try loading some data, partition the data and write queries
>>>>> based on partition then performance will improve but in that case your
>>>>> queries will be in batch processing format. there are other approaches as
>>>>> well.
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bhavesh25shah@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> That I fail to know, how many maps and reducers are there. Because
>>>>>> due to some reason my instance get terminated   :(
>>>>>> I want to know one thing that If we use multiple nodes, then what
>>>>>> should be the count of maps and reducers.
>>>>>> Actually I am confused about that. How to decide it?
>>>>>>
>>>>>> Also I want to try the different properties like block size, compress
>>>>>> output, size of in-memorybuffer, parallel execution etc.
>>>>>> Will these all properties matters to increase the performance?
>>>>>>
>>>>>> Nitin, you have read all my use case. Whatever the thing I did to
>>>>>> implement with the help of Hadoop is correct?
>>>>>> Is it possible to increase the performance?
>>>>>>
>>>>>> Thanks Nitin for your reply.   :)
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Bhavesh Shah
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> with a 10 node cluster the performance should improve.
>>>>>>> how many maps and reducers are being launched?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <
>>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have near about 1 billion records in my relational database.
>>>>>>>> Currently locally I am using just one cluster. But I also tried
>>>>>>>> this on Amazon Elastic Mapreduce with 10 nodes. But the time taken to
>>>>>>>> execute the complete program is same as that on my  single local machine.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> how many # records?
>>>>>>>>>
>>>>>>>>> what is your hadoop cluster setup? how many nodes?
>>>>>>>>> if you are running hadoop on a single node setup with normal
>>>>>>>>> desktop, i doubt it will be of any help.
>>>>>>>>>
>>>>>>>>> You need a stronger cluster setup for better query runtimes and
>>>>>>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>> My Use Case is:
>>>>>>>>>> 1) I have a relational database which has a very large data. (MS
>>>>>>>>>> SQL Server)
>>>>>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>>>>>> reports
>>>>>>>>>> on it after analysis.
>>>>>>>>>> Like this I have to generate various reports based on different
>>>>>>>>>> analysis.
>>>>>>>>>>
>>>>>>>>>> I tried to implement this using Hive. What I did is:
>>>>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on
>>>>>>>>>> Hive
>>>>>>>>>> Thrift Server
>>>>>>>>>> 3) I am getting the correct result in table form, which I am
>>>>>>>>>> expecting
>>>>>>>>>> 4) But the problem is that the time which require to execute is
>>>>>>>>>> too much
>>>>>>>>>> long.
>>>>>>>>>>    (My complete program is executing in near about 3-4 hours on
>>>>>>>>>> *small
>>>>>>>>>> amount of data*).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    I decided to do this using Hive.
>>>>>>>>>>     And as I told previously how much time Hive consumed for
>>>>>>>>>> execution. my
>>>>>>>>>> organization is expecting to complete this task in near about
>>>>>>>>>> less than
>>>>>>>>>> 1/2 hours
>>>>>>>>>>
>>>>>>>>>> Now after spending too much time for complete execution for this
>>>>>>>>>> task what
>>>>>>>>>> should I do?
>>>>>>>>>> I want to ask one thing that:
>>>>>>>>>> *Is this Use Case is possible with Hive?* If possible what should
>>>>>>>>>> I do in
>>>>>>>>>>
>>>>>>>>>> my program to increase the performance?
>>>>>>>>>> *And If not possible what is the other good way to implement this
>>>>>>>>>> Use Case?*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please reply me.
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards,
>>>>>>>>>> Bhavesh Shah
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> Bhavesh Shah
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>>
>>
>>
>>
>>
>

Re: Is my Use Case possible with Hive?

Posted by Nanda Vijaydev <na...@gmail.com>.

Hadoop in general does well with fewer large data files instead of more
smaller data files. RDBMS type of indexing and run time optimization is not
exactly available in Hadoop/Hive yet. So one suggestion is to combine some
of this data, if you can, into fewer tables as you are doing sqoop. Even if
there is a slight redundancy it should be OK. Storage is cheap and helps
during read.

Other suggestions as given in this thread is to set map side and reduce
side hive optimization parameters. Querying via jdbc is generally slow as
well. There are certain products in Hadoop space that allow for hive
querying without jdbc interface. Give it a try and it should improve
performance.

Good luck


On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <bh...@gmail.com>wrote:

> Thanks Nitin for your continous support.
> *Here is my data layout and change the queries as per needed*:
> 1) Initially after importing the tables from MS SQL Server, 1st basic task
> I am doing is that *PIVOTING.*
>    As SQL stores data in name value pair.
> 2) Pivoting results in subset of data, Using this subset we are running
> complex queries on history data and retrieves result for each row in
> subset.
>     again *data is updated into pivoted columns*. (I am not using
> partition. updated by INSERT OVERWRITE)
>     As update is not supporting, I have to again do *INSERT OVERWRITE
> TABLE
> *3) Likewise I have to do near about 20-30 times. (Depends upon Business
> rules and scenario if needed to Business rules)
> 4) After this I have to do computation which has very large queries from
> above generated tables.
>     (Each query has near about 10-11 jobs query)
>     This again repeats for 30 times.
>
> (My all queries contains -  case when, group by, cast function, etc )
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> partitioning is mainly used when you want to access the table based on
>> value of a particular column and dont want to go through entire table for
>> same operation. This actually means if there are few columns whose values
>> are repeated in all the records, then you can consider partitioning on
>> them. Other approach will be partition data based on date/time if
>> applicable.
>>
>> From the queries you showed, i am just seeing inserting and creating
>> indexes. loading data to tables should not take much time and I personally
>> have never used indexing so can not tell about that particular query
>> execution time.
>>
>> if I understand correctly following is your execution approach
>>
>> 1) Import data from MS-SQL to hive using sqoop
>>     should be over quickly depending on how much time MS-SQL takes to
>> export
>> 2) example of queries which you are doing on the data being dumped in
>> hive will be good to know if we can decide on the data layout and change
>> the queries as per needed if needed
>> 3) Once query execution is over you are putting the result back in MS-SQL
>>
>> can you note individually how much time each step is taking?
>>
>>
>> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>
>>> Hello Nitin,
>>> Thanks for suggesting me about the partition.
>>> But I want to tell one thing that I forgot to mention before is that :*
>>> I am using Indexes on all tables tables which are used again and again.
>>> *
>>> But the problem is that after execution I didn't see the difference in
>>> performance (before applying the index and after applying it)
>>> I have created the indexes as below:
>>> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
>>> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
>>> res2 = stmt2.executeQuery(sql);
>>> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
>>> C1.VisitDate, C1.ID from
>>>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
>>> C1.VisitDate=T.VisitDate").toString();
>>> stmt2.executeUpdate(sql);
>>> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
>>> TmpElementTable";
>>> stmt2.executeUpdate(sql);
>>> sql = "alter index clinical_index on TmpElementTable REBUILD";
>>> res2 = stmt2.executeQuery(sql);
>>> *Did I use it in correct way?*
>>>
>>> As you told me told me to try with partition
>>> Actually I am altering the table with large number of columns at the
>>> runtime only.
>>> If i use partition in such situation then is it good to use partition
>>> for all columns?
>>>
>>> So, I want to know that After using the partition Will it be able to
>>> improve the performance or
>>> do I need to use both Partition and Indexes?
>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> it is definitely possible to increase your performance.
>>>>
>>>> I have run queries where more than 10 billion records were involved.
>>>> If you are doing joins in your queries, you may have a look at
>>>> different kind of joins supported by hive.
>>>> If one of your table is very small in size compared to another table
>>>> then you may consider mapside join etc
>>>>
>>>> Also the number of maps and reducers are decided by the split size you
>>>> provide to maps.
>>>>
>>>> I would suggest before you go full speed, decide on how you want to
>>>> layout data for hive.
>>>>
>>>> You can try loading some data, partition the data and write queries
>>>> based on partition then performance will improve but in that case your
>>>> queries will be in batch processing format. there are other approaches as
>>>> well.
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>>>
>>>>> That I fail to know, how many maps and reducers are there. Because due
>>>>> to some reason my instance get terminated   :(
>>>>> I want to know one thing that If we use multiple nodes, then what
>>>>> should be the count of maps and reducers.
>>>>> Actually I am confused about that. How to decide it?
>>>>>
>>>>> Also I want to try the different properties like block size, compress
>>>>> output, size of in-memorybuffer, parallel execution etc.
>>>>> Will these all properties matters to increase the performance?
>>>>>
>>>>> Nitin, you have read all my use case. Whatever the thing I did to
>>>>> implement with the help of Hadoop is correct?
>>>>> Is it possible to increase the performance?
>>>>>
>>>>> Thanks Nitin for your reply.   :)
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bhavesh Shah
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> with a 10 node cluster the performance should improve.
>>>>>> how many maps and reducers are being launched?
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <
>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>
>>>>>>> I have near about 1 billion records in my relational database.
>>>>>>> Currently locally I am using just one cluster. But I also tried this
>>>>>>> on Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute
>>>>>>> the complete program is same as that on my  single local machine.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> how many # records?
>>>>>>>>
>>>>>>>> what is your hadoop cluster setup? how many nodes?
>>>>>>>> if you are running hadoop on a single node setup with normal
>>>>>>>> desktop, i doubt it will be of any help.
>>>>>>>>
>>>>>>>> You need a stronger cluster setup for better query runtimes and
>>>>>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>> My Use Case is:
>>>>>>>>> 1) I have a relational database which has a very large data. (MS
>>>>>>>>> SQL Server)
>>>>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>>>>> reports
>>>>>>>>> on it after analysis.
>>>>>>>>> Like this I have to generate various reports based on different
>>>>>>>>> analysis.
>>>>>>>>>
>>>>>>>>> I tried to implement this using Hive. What I did is:
>>>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on
>>>>>>>>> Hive
>>>>>>>>> Thrift Server
>>>>>>>>> 3) I am getting the correct result in table form, which I am
>>>>>>>>> expecting
>>>>>>>>> 4) But the problem is that the time which require to execute is
>>>>>>>>> too much
>>>>>>>>> long.
>>>>>>>>>    (My complete program is executing in near about 3-4 hours on
>>>>>>>>> *small
>>>>>>>>> amount of data*).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    I decided to do this using Hive.
>>>>>>>>>     And as I told previously how much time Hive consumed for
>>>>>>>>> execution. my
>>>>>>>>> organization is expecting to complete this task in near about less
>>>>>>>>> than
>>>>>>>>> 1/2 hours
>>>>>>>>>
>>>>>>>>> Now after spending too much time for complete execution for this
>>>>>>>>> task what
>>>>>>>>> should I do?
>>>>>>>>> I want to ask one thing that:
>>>>>>>>> *Is this Use Case is possible with Hive?* If possible what should
>>>>>>>>> I do in
>>>>>>>>>
>>>>>>>>> my program to increase the performance?
>>>>>>>>> *And If not possible what is the other good way to implement this
>>>>>>>>> Use Case?*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please reply me.
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> Bhavesh Shah
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Bhavesh Shah
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>
>
>
>

Re: Is my Use Case possible with Hive?

Posted by Bhavesh Shah <bh...@gmail.com>.

Thanks Nitin for your continous support.
*Here is my data layout and change the queries as per needed*:
1) Initially after importing the tables from MS SQL Server, 1st basic task
I am doing is that *PIVOTING.*
   As SQL stores data in name value pair.
2) Pivoting results in subset of data, Using this subset we are running
complex queries on history data and retrieves result for each row in
subset.
    again *data is updated into pivoted columns*. (I am not using
partition. updated by INSERT OVERWRITE)
    As update is not supporting, I have to again do *INSERT OVERWRITE TABLE
*3) Likewise I have to do near about 20-30 times. (Depends upon Business
rules and scenario if needed to Business rules)
4) After this I have to do computation which has very large queries from
above generated tables.
    (Each query has near about 10-11 jobs query)
    This again repeats for 30 times.

(My all queries contains -  case when, group by, cast function, etc )

-- 
Regards,
Bhavesh Shah


On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <ni...@gmail.com>wrote:

> partitioning is mainly used when you want to access the table based on
> value of a particular column and dont want to go through entire table for
> same operation. This actually means if there are few columns whose values
> are repeated in all the records, then you can consider partitioning on
> them. Other approach will be partition data based on date/time if
> applicable.
>
> From the queries you showed, i am just seeing inserting and creating
> indexes. loading data to tables should not take much time and I personally
> have never used indexing so can not tell about that particular query
> execution time.
>
> if I understand correctly following is your execution approach
>
> 1) Import data from MS-SQL to hive using sqoop
>     should be over quickly depending on how much time MS-SQL takes to
> export
> 2) example of queries which you are doing on the data being dumped in hive
> will be good to know if we can decide on the data layout and change the
> queries as per needed if needed
> 3) Once query execution is over you are putting the result back in MS-SQL
>
> can you note individually how much time each step is taking?
>
>
> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>
>> Hello Nitin,
>> Thanks for suggesting me about the partition.
>> But I want to tell one thing that I forgot to mention before is that :*
>> I am using Indexes on all tables tables which are used again and again. *
>> But the problem is that after execution I didn't see the difference in
>> performance (before applying the index and after applying it)
>> I have created the indexes as below:
>> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
>> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
>> res2 = stmt2.executeQuery(sql);
>> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
>> C1.VisitDate, C1.ID from
>>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
>> C1.VisitDate=T.VisitDate").toString();
>> stmt2.executeUpdate(sql);
>> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
>> TmpElementTable";
>> stmt2.executeUpdate(sql);
>> sql = "alter index clinical_index on TmpElementTable REBUILD";
>> res2 = stmt2.executeQuery(sql);
>> *Did I use it in correct way?*
>>
>> As you told me told me to try with partition
>> Actually I am altering the table with large number of columns at the
>> runtime only.
>> If i use partition in such situation then is it good to use partition for
>> all columns?
>>
>> So, I want to know that After using the partition Will it be able to
>> improve the performance or
>> do I need to use both Partition and Indexes?
>>
>>
>>
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> it is definitely possible to increase your performance.
>>>
>>> I have run queries where more than 10 billion records were involved.
>>> If you are doing joins in your queries, you may have a look at different
>>> kind of joins supported by hive.
>>> If one of your table is very small in size compared to another table
>>> then you may consider mapside join etc
>>>
>>> Also the number of maps and reducers are decided by the split size you
>>> provide to maps.
>>>
>>> I would suggest before you go full speed, decide on how you want to
>>> layout data for hive.
>>>
>>> You can try loading some data, partition the data and write queries
>>> based on partition then performance will improve but in that case your
>>> queries will be in batch processing format. there are other approaches as
>>> well.
>>>
>>>
>>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>>
>>>> That I fail to know, how many maps and reducers are there. Because due
>>>> to some reason my instance get terminated   :(
>>>> I want to know one thing that If we use multiple nodes, then what
>>>> should be the count of maps and reducers.
>>>> Actually I am confused about that. How to decide it?
>>>>
>>>> Also I want to try the different properties like block size, compress
>>>> output, size of in-memorybuffer, parallel execution etc.
>>>> Will these all properties matters to increase the performance?
>>>>
>>>> Nitin, you have read all my use case. Whatever the thing I did to
>>>> implement with the help of Hadoop is correct?
>>>> Is it possible to increase the performance?
>>>>
>>>> Thanks Nitin for your reply.   :)
>>>>
>>>> --
>>>> Regards,
>>>> Bhavesh Shah
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> with a 10 node cluster the performance should improve.
>>>>> how many maps and reducers are being launched?
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <bhavesh25shah@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> I have near about 1 billion records in my relational database.
>>>>>> Currently locally I am using just one cluster. But I also tried this
>>>>>> on Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute
>>>>>> the complete program is same as that on my  single local machine.
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> how many # records?
>>>>>>>
>>>>>>> what is your hadoop cluster setup? how many nodes?
>>>>>>> if you are running hadoop on a single node setup with normal
>>>>>>> desktop, i doubt it will be of any help.
>>>>>>>
>>>>>>> You need a stronger cluster setup for better query runtimes and
>>>>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello all,
>>>>>>>> My Use Case is:
>>>>>>>> 1) I have a relational database which has a very large data. (MS
>>>>>>>> SQL Server)
>>>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>>>> reports
>>>>>>>> on it after analysis.
>>>>>>>> Like this I have to generate various reports based on different
>>>>>>>> analysis.
>>>>>>>>
>>>>>>>> I tried to implement this using Hive. What I did is:
>>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on
>>>>>>>> Hive
>>>>>>>> Thrift Server
>>>>>>>> 3) I am getting the correct result in table form, which I am
>>>>>>>> expecting
>>>>>>>> 4) But the problem is that the time which require to execute is too
>>>>>>>> much
>>>>>>>> long.
>>>>>>>>    (My complete program is executing in near about 3-4 hours on
>>>>>>>> *small
>>>>>>>> amount of data*).
>>>>>>>>
>>>>>>>>
>>>>>>>>    I decided to do this using Hive.
>>>>>>>>     And as I told previously how much time Hive consumed for
>>>>>>>> execution. my
>>>>>>>> organization is expecting to complete this task in near about less
>>>>>>>> than
>>>>>>>> 1/2 hours
>>>>>>>>
>>>>>>>> Now after spending too much time for complete execution for this
>>>>>>>> task what
>>>>>>>> should I do?
>>>>>>>> I want to ask one thing that:
>>>>>>>> *Is this Use Case is possible with Hive?* If possible what should I
>>>>>>>> do in
>>>>>>>>
>>>>>>>> my program to increase the performance?
>>>>>>>> *And If not possible what is the other good way to implement this
>>>>>>>> Use Case?*
>>>>>>>>
>>>>>>>>
>>>>>>>> Please reply me.
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> Bhavesh Shah
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Bhavesh Shah
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>>
>>
>>
>>
>>
>
>
> --
> Nitin Pawar
>
>

Re: Is my Use Case possible with Hive?

Posted by Nitin Pawar <ni...@gmail.com>.

partitioning is mainly used when you want to access the table based on
value of a particular column and dont want to go through entire table for
same operation. This actually means if there are few columns whose values
are repeated in all the records, then you can consider partitioning on
them. Other approach will be partition data based on date/time if
applicable.

>From the queries you showed, i am just seeing inserting and creating
indexes. loading data to tables should not take much time and I personally
have never used indexing so can not tell about that particular query
execution time.

if I understand correctly following is your execution approach

1) Import data from MS-SQL to hive using sqoop
    should be over quickly depending on how much time MS-SQL takes to
export
2) example of queries which you are doing on the data being dumped in hive
will be good to know if we can decide on the data layout and change the
queries as per needed if needed
3) Once query execution is over you are putting the result back in MS-SQL

can you note individually how much time each step is taking?

On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <bh...@gmail.com>wrote:

> Hello Nitin,
> Thanks for suggesting me about the partition.
> But I want to tell one thing that I forgot to mention before is that :*
> I am using Indexes on all tables tables which are used again and again. *
> But the problem is that after execution I didn't see the difference in
> performance (before applying the index and after applying it)
> I have created the indexes as below:
> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
> res2 = stmt2.executeQuery(sql);
> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
> C1.VisitDate, C1.ID from
>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
> C1.VisitDate=T.VisitDate").toString();
> stmt2.executeUpdate(sql);
> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
> TmpElementTable";
> stmt2.executeUpdate(sql);
> sql = "alter index clinical_index on TmpElementTable REBUILD";
> res2 = stmt2.executeQuery(sql);
> *Did I use it in correct way?*
>
> As you told me told me to try with partition
> Actually I am altering the table with large number of columns at the
> runtime only.
> If i use partition in such situation then is it good to use partition for
> all columns?
>
> So, I want to know that After using the partition Will it be able to
> improve the performance or
> do I need to use both Partition and Indexes?
>
>
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> it is definitely possible to increase your performance.
>>
>> I have run queries where more than 10 billion records were involved.
>> If you are doing joins in your queries, you may have a look at different
>> kind of joins supported by hive.
>> If one of your table is very small in size compared to another table then
>> you may consider mapside join etc
>>
>> Also the number of maps and reducers are decided by the split size you
>> provide to maps.
>>
>> I would suggest before you go full speed, decide on how you want to
>> layout data for hive.
>>
>> You can try loading some data, partition the data and write queries based
>> on partition then performance will improve but in that case your queries
>> will be in batch processing format. there are other approaches as well.
>>
>>
>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>
>>> That I fail to know, how many maps and reducers are there. Because due
>>> to some reason my instance get terminated   :(
>>> I want to know one thing that If we use multiple nodes, then what should
>>> be the count of maps and reducers.
>>> Actually I am confused about that. How to decide it?
>>>
>>> Also I want to try the different properties like block size, compress
>>> output, size of in-memorybuffer, parallel execution etc.
>>> Will these all properties matters to increase the performance?
>>>
>>> Nitin, you have read all my use case. Whatever the thing I did to
>>> implement with the help of Hadoop is correct?
>>> Is it possible to increase the performance?
>>>
>>> Thanks Nitin for your reply.   :)
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> with a 10 node cluster the performance should improve.
>>>> how many maps and reducers are being launched?
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>>>
>>>>> I have near about 1 billion records in my relational database.
>>>>> Currently locally I am using just one cluster. But I also tried this
>>>>> on Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute
>>>>> the complete program is same as that on my  single local machine.
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> how many # records?
>>>>>>
>>>>>> what is your hadoop cluster setup? how many nodes?
>>>>>> if you are running hadoop on a single node setup with normal desktop,
>>>>>> i doubt it will be of any help.
>>>>>>
>>>>>> You need a stronger cluster setup for better query runtimes and
>>>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>> My Use Case is:
>>>>>>> 1) I have a relational database which has a very large data. (MS SQL
>>>>>>> Server)
>>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>>> reports
>>>>>>> on it after analysis.
>>>>>>> Like this I have to generate various reports based on different
>>>>>>> analysis.
>>>>>>>
>>>>>>> I tried to implement this using Hive. What I did is:
>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>>>>>>> Thrift Server
>>>>>>> 3) I am getting the correct result in table form, which I am
>>>>>>> expecting
>>>>>>> 4) But the problem is that the time which require to execute is too
>>>>>>> much
>>>>>>> long.
>>>>>>>    (My complete program is executing in near about 3-4 hours on
>>>>>>> *small
>>>>>>> amount of data*).
>>>>>>>
>>>>>>>
>>>>>>>    I decided to do this using Hive.
>>>>>>>     And as I told previously how much time Hive consumed for
>>>>>>> execution. my
>>>>>>> organization is expecting to complete this task in near about less
>>>>>>> than
>>>>>>> 1/2 hours
>>>>>>>
>>>>>>> Now after spending too much time for complete execution for this
>>>>>>> task what
>>>>>>> should I do?
>>>>>>> I want to ask one thing that:
>>>>>>> *Is this Use Case is possible with Hive?* If possible what should I
>>>>>>> do in
>>>>>>>
>>>>>>> my program to increase the performance?
>>>>>>> *And If not possible what is the other good way to implement this
>>>>>>> Use Case?*
>>>>>>>
>>>>>>>
>>>>>>> Please reply me.
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Bhavesh Shah
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bhavesh Shah
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>
>
>
>


-- 
Nitin Pawar

Re: Is my Use Case possible with Hive?

Posted by Bhavesh Shah <bh...@gmail.com>.

Hello Nitin,
Thanks for suggesting me about the partition.
But I want to tell one thing that I forgot to mention before is that :*
I am using Indexes on all tables tables which are used again and again. *
But the problem is that after execution I didn't see the difference in
performance (before applying the index and after applying it)
I have created the indexes as below:
sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
res2 = stmt2.executeQuery(sql);
sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
C1.VisitDate, C1.ID from
       TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
C1.VisitDate=T.VisitDate").toString();
stmt2.executeUpdate(sql);
sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
TmpElementTable";
stmt2.executeUpdate(sql);
sql = "alter index clinical_index on TmpElementTable REBUILD";
res2 = stmt2.executeQuery(sql);
*Did I use it in correct way?*

As you told me told me to try with partition
Actually I am altering the table with large number of columns at the
runtime only.
If i use partition in such situation then is it good to use partition for
all columns?

So, I want to know that After using the partition Will it be able to
improve the performance or
do I need to use both Partition and Indexes?




-- 
Regards,
Bhavesh Shah


On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <ni...@gmail.com>wrote:

> it is definitely possible to increase your performance.
>
> I have run queries where more than 10 billion records were involved.
> If you are doing joins in your queries, you may have a look at different
> kind of joins supported by hive.
> If one of your table is very small in size compared to another table then
> you may consider mapside join etc
>
> Also the number of maps and reducers are decided by the split size you
> provide to maps.
>
> I would suggest before you go full speed, decide on how you want to layout
> data for hive.
>
> You can try loading some data, partition the data and write queries based
> on partition then performance will improve but in that case your queries
> will be in batch processing format. there are other approaches as well.
>
>
> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>
>> That I fail to know, how many maps and reducers are there. Because due to
>> some reason my instance get terminated   :(
>> I want to know one thing that If we use multiple nodes, then what should
>> be the count of maps and reducers.
>> Actually I am confused about that. How to decide it?
>>
>> Also I want to try the different properties like block size, compress
>> output, size of in-memorybuffer, parallel execution etc.
>> Will these all properties matters to increase the performance?
>>
>> Nitin, you have read all my use case. Whatever the thing I did to
>> implement with the help of Hadoop is correct?
>> Is it possible to increase the performance?
>>
>> Thanks Nitin for your reply.   :)
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> with a 10 node cluster the performance should improve.
>>> how many maps and reducers are being launched?
>>>
>>>
>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>>
>>>> I have near about 1 billion records in my relational database.
>>>> Currently locally I am using just one cluster. But I also tried this on
>>>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the
>>>> complete program is same as that on my  single local machine.
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> how many # records?
>>>>>
>>>>> what is your hadoop cluster setup? how many nodes?
>>>>> if you are running hadoop on a single node setup with normal desktop,
>>>>> i doubt it will be of any help.
>>>>>
>>>>> You need a stronger cluster setup for better query runtimes and
>>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>
>>>>>> Hello all,
>>>>>> My Use Case is:
>>>>>> 1) I have a relational database which has a very large data. (MS SQL
>>>>>> Server)
>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>> reports
>>>>>> on it after analysis.
>>>>>> Like this I have to generate various reports based on different
>>>>>> analysis.
>>>>>>
>>>>>> I tried to implement this using Hive. What I did is:
>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>>>>>> Thrift Server
>>>>>> 3) I am getting the correct result in table form, which I am expecting
>>>>>> 4) But the problem is that the time which require to execute is too
>>>>>> much
>>>>>> long.
>>>>>>    (My complete program is executing in near about 3-4 hours on *small
>>>>>> amount of data*).
>>>>>>
>>>>>>
>>>>>>    I decided to do this using Hive.
>>>>>>     And as I told previously how much time Hive consumed for
>>>>>> execution. my
>>>>>> organization is expecting to complete this task in near about less
>>>>>> than
>>>>>> 1/2 hours
>>>>>>
>>>>>> Now after spending too much time for complete execution for this task
>>>>>> what
>>>>>> should I do?
>>>>>> I want to ask one thing that:
>>>>>> *Is this Use Case is possible with Hive?* If possible what should I
>>>>>> do in
>>>>>>
>>>>>> my program to increase the performance?
>>>>>> *And If not possible what is the other good way to implement this Use
>>>>>> Case?*
>>>>>>
>>>>>>
>>>>>> Please reply me.
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Bhavesh Shah
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bhavesh Shah
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>>
>>
>>
>>
>>
>
>
> --
> Nitin Pawar
>
>

Re: Is my Use Case possible with Hive?

Posted by Justin Coffey <jq...@gmail.com>.

You can also have a reduce-side bottleneck if, for example, you are doing
distinct counts or with skewed group sizes (ie one aggregation group is
much larger than others).

But to know this you really need to look at the stats of your jobs via the
jobtracker and even the progress counter output of hive.

If you do see one reducer getting all the work you can try setting
hive.groupby.skewindata=true.

-Justin

On Mon, May 14, 2012 at 11:43 AM, Nitin Pawar <ni...@gmail.com>wrote:

> it is definitely possible to increase your performance.
>
> I have run queries where more than 10 billion records were involved.
> If you are doing joins in your queries, you may have a look at different
> kind of joins supported by hive.
> If one of your table is very small in size compared to another table then
> you may consider mapside join etc
>
> Also the number of maps and reducers are decided by the split size you
> provide to maps.
>
> I would suggest before you go full speed, decide on how you want to layout
> data for hive.
>
> You can try loading some data, partition the data and write queries based
> on partition then performance will improve but in that case your queries
> will be in batch processing format. there are other approaches as well.
>
>
> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>
>> That I fail to know, how many maps and reducers are there. Because due to
>> some reason my instance get terminated   :(
>> I want to know one thing that If we use multiple nodes, then what should
>> be the count of maps and reducers.
>> Actually I am confused about that. How to decide it?
>>
>> Also I want to try the different properties like block size, compress
>> output, size of in-memorybuffer, parallel execution etc.
>> Will these all properties matters to increase the performance?
>>
>> Nitin, you have read all my use case. Whatever the thing I did to
>> implement with the help of Hadoop is correct?
>> Is it possible to increase the performance?
>>
>> Thanks Nitin for your reply.   :)
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> with a 10 node cluster the performance should improve.
>>> how many maps and reducers are being launched?
>>>
>>>
>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>>
>>>> I have near about 1 billion records in my relational database.
>>>> Currently locally I am using just one cluster. But I also tried this on
>>>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the
>>>> complete program is same as that on my  single local machine.
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> how many # records?
>>>>>
>>>>> what is your hadoop cluster setup? how many nodes?
>>>>> if you are running hadoop on a single node setup with normal desktop,
>>>>> i doubt it will be of any help.
>>>>>
>>>>> You need a stronger cluster setup for better query runtimes and
>>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>
>>>>>> Hello all,
>>>>>> My Use Case is:
>>>>>> 1) I have a relational database which has a very large data. (MS SQL
>>>>>> Server)
>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>> reports
>>>>>> on it after analysis.
>>>>>> Like this I have to generate various reports based on different
>>>>>> analysis.
>>>>>>
>>>>>> I tried to implement this using Hive. What I did is:
>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>>>>>> Thrift Server
>>>>>> 3) I am getting the correct result in table form, which I am expecting
>>>>>> 4) But the problem is that the time which require to execute is too
>>>>>> much
>>>>>> long.
>>>>>>    (My complete program is executing in near about 3-4 hours on *small
>>>>>> amount of data*).
>>>>>>
>>>>>>
>>>>>>    I decided to do this using Hive.
>>>>>>     And as I told previously how much time Hive consumed for
>>>>>> execution. my
>>>>>> organization is expecting to complete this task in near about less
>>>>>> than
>>>>>> 1/2 hours
>>>>>>
>>>>>> Now after spending too much time for complete execution for this task
>>>>>> what
>>>>>> should I do?
>>>>>> I want to ask one thing that:
>>>>>> *Is this Use Case is possible with Hive?* If possible what should I
>>>>>> do in
>>>>>>
>>>>>> my program to increase the performance?
>>>>>> *And If not possible what is the other good way to implement this Use
>>>>>> Case?*
>>>>>>
>>>>>>
>>>>>> Please reply me.
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Bhavesh Shah
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bhavesh Shah
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>>
>>
>>
>>
>>
>
>
> --
> Nitin Pawar
>
>


-- 
jqcoffey@gmail.com
-----

Re: Is my Use Case possible with Hive?

Posted by Nitin Pawar <ni...@gmail.com>.

it is definitely possible to increase your performance.

I have run queries where more than 10 billion records were involved.
If you are doing joins in your queries, you may have a look at different
kind of joins supported by hive.
If one of your table is very small in size compared to another table then
you may consider mapside join etc

Also the number of maps and reducers are decided by the split size you
provide to maps.

I would suggest before you go full speed, decide on how you want to layout
data for hive.

You can try loading some data, partition the data and write queries based
on partition then performance will improve but in that case your queries
will be in batch processing format. there are other approaches as well.


On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bh...@gmail.com>wrote:

> That I fail to know, how many maps and reducers are there. Because due to
> some reason my instance get terminated   :(
> I want to know one thing that If we use multiple nodes, then what should
> be the count of maps and reducers.
> Actually I am confused about that. How to decide it?
>
> Also I want to try the different properties like block size, compress
> output, size of in-memorybuffer, parallel execution etc.
> Will these all properties matters to increase the performance?
>
> Nitin, you have read all my use case. Whatever the thing I did to
> implement with the help of Hadoop is correct?
> Is it possible to increase the performance?
>
> Thanks Nitin for your reply.   :)
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> with a 10 node cluster the performance should improve.
>> how many maps and reducers are being launched?
>>
>>
>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>
>>> I have near about 1 billion records in my relational database.
>>> Currently locally I am using just one cluster. But I also tried this on
>>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the
>>> complete program is same as that on my  single local machine.
>>>
>>>
>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> how many # records?
>>>>
>>>> what is your hadoop cluster setup? how many nodes?
>>>> if you are running hadoop on a single node setup with normal desktop, i
>>>> doubt it will be of any help.
>>>>
>>>> You need a stronger cluster setup for better query runtimes and
>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <bhavesh25shah@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello all,
>>>>> My Use Case is:
>>>>> 1) I have a relational database which has a very large data. (MS SQL
>>>>> Server)
>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>> reports
>>>>> on it after analysis.
>>>>> Like this I have to generate various reports based on different
>>>>> analysis.
>>>>>
>>>>> I tried to implement this using Hive. What I did is:
>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>>>>> Thrift Server
>>>>> 3) I am getting the correct result in table form, which I am expecting
>>>>> 4) But the problem is that the time which require to execute is too
>>>>> much
>>>>> long.
>>>>>    (My complete program is executing in near about 3-4 hours on *small
>>>>> amount of data*).
>>>>>
>>>>>
>>>>>    I decided to do this using Hive.
>>>>>     And as I told previously how much time Hive consumed for
>>>>> execution. my
>>>>> organization is expecting to complete this task in near about less than
>>>>> 1/2 hours
>>>>>
>>>>> Now after spending too much time for complete execution for this task
>>>>> what
>>>>> should I do?
>>>>> I want to ask one thing that:
>>>>> *Is this Use Case is possible with Hive?* If possible what should I do
>>>>> in
>>>>>
>>>>> my program to increase the performance?
>>>>> *And If not possible what is the other good way to implement this Use
>>>>> Case?*
>>>>>
>>>>>
>>>>> Please reply me.
>>>>> Thanks
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bhavesh Shah
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>
>
>
>


-- 
Nitin Pawar

Re: Is my Use Case possible with Hive?

Posted by Bhavesh Shah <bh...@gmail.com>.

That I fail to know, how many maps and reducers are there. Because due to
some reason my instance get terminated   :(
I want to know one thing that If we use multiple nodes, then what should be
the count of maps and reducers.
Actually I am confused about that. How to decide it?

Also I want to try the different properties like block size, compress
output, size of in-memorybuffer, parallel execution etc.
Will these all properties matters to increase the performance?

Nitin, you have read all my use case. Whatever the thing I did to implement
with the help of Hadoop is correct?
Is it possible to increase the performance?

Thanks Nitin for your reply.   :)

-- 
Regards,
Bhavesh Shah


On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <ni...@gmail.com>wrote:

> with a 10 node cluster the performance should improve.
> how many maps and reducers are being launched?
>
>
> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>
>> I have near about 1 billion records in my relational database.
>> Currently locally I am using just one cluster. But I also tried this on
>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the
>> complete program is same as that on my  single local machine.
>>
>>
>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> how many # records?
>>>
>>> what is your hadoop cluster setup? how many nodes?
>>> if you are running hadoop on a single node setup with normal desktop, i
>>> doubt it will be of any help.
>>>
>>> You need a stronger cluster setup for better query runtimes and ofcourse
>>> query optimization which I guess you would have already taken care.
>>>
>>>
>>>
>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>>
>>>> Hello all,
>>>> My Use Case is:
>>>> 1) I have a relational database which has a very large data. (MS SQL
>>>> Server)
>>>> 2) I want to do analysis on these huge data  and want to generate
>>>> reports
>>>> on it after analysis.
>>>> Like this I have to generate various reports based on different
>>>> analysis.
>>>>
>>>> I tried to implement this using Hive. What I did is:
>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>>>> Thrift Server
>>>> 3) I am getting the correct result in table form, which I am expecting
>>>> 4) But the problem is that the time which require to execute is too much
>>>> long.
>>>>    (My complete program is executing in near about 3-4 hours on *small
>>>> amount of data*).
>>>>
>>>>
>>>>    I decided to do this using Hive.
>>>>     And as I told previously how much time Hive consumed for execution.
>>>> my
>>>> organization is expecting to complete this task in near about less than
>>>> 1/2 hours
>>>>
>>>> Now after spending too much time for complete execution for this task
>>>> what
>>>> should I do?
>>>> I want to ask one thing that:
>>>> *Is this Use Case is possible with Hive?* If possible what should I do
>>>> in
>>>>
>>>> my program to increase the performance?
>>>> *And If not possible what is the other good way to implement this Use
>>>> Case?*
>>>>
>>>>
>>>> Please reply me.
>>>> Thanks
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bhavesh Shah
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>>
>>
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>
>
> --
> Nitin Pawar
>
>

Re: Is my Use Case possible with Hive?

Posted by Nitin Pawar <ni...@gmail.com>.

with a 10 node cluster the performance should improve.
how many maps and reducers are being launched?


On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <bh...@gmail.com>wrote:

> I have near about 1 billion records in my relational database.
> Currently locally I am using just one cluster. But I also tried this on
> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the
> complete program is same as that on my  single local machine.
>
>
> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> how many # records?
>>
>> what is your hadoop cluster setup? how many nodes?
>> if you are running hadoop on a single node setup with normal desktop, i
>> doubt it will be of any help.
>>
>> You need a stronger cluster setup for better query runtimes and ofcourse
>> query optimization which I guess you would have already taken care.
>>
>>
>>
>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>>
>>> Hello all,
>>> My Use Case is:
>>> 1) I have a relational database which has a very large data. (MS SQL
>>> Server)
>>> 2) I want to do analysis on these huge data  and want to generate reports
>>> on it after analysis.
>>> Like this I have to generate various reports based on different analysis.
>>>
>>> I tried to implement this using Hive. What I did is:
>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>>> Thrift Server
>>> 3) I am getting the correct result in table form, which I am expecting
>>> 4) But the problem is that the time which require to execute is too much
>>> long.
>>>    (My complete program is executing in near about 3-4 hours on *small
>>> amount of data*).
>>>
>>>
>>>    I decided to do this using Hive.
>>>     And as I told previously how much time Hive consumed for execution.
>>> my
>>> organization is expecting to complete this task in near about less than
>>> 1/2 hours
>>>
>>> Now after spending too much time for complete execution for this task
>>> what
>>> should I do?
>>> I want to ask one thing that:
>>> *Is this Use Case is possible with Hive?* If possible what should I do in
>>>
>>> my program to increase the performance?
>>> *And If not possible what is the other good way to implement this Use
>>> Case?*
>>>
>>>
>>> Please reply me.
>>> Thanks
>>>
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>
>
> --
> Regards,
> Bhavesh Shah
>
>


-- 
Nitin Pawar

Re: Is my Use Case possible with Hive?

Posted by Bhavesh Shah <bh...@gmail.com>.

I have near about 1 billion records in my relational database.
Currently locally I am using just one cluster. But I also tried this on
Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the
complete program is same as that on my  single local machine.

On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <ni...@gmail.com>wrote:

> how many # records?
>
> what is your hadoop cluster setup? how many nodes?
> if you are running hadoop on a single node setup with normal desktop, i
> doubt it will be of any help.
>
> You need a stronger cluster setup for better query runtimes and ofcourse
> query optimization which I guess you would have already taken care.
>
>
>
> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <bh...@gmail.com>wrote:
>
>> Hello all,
>> My Use Case is:
>> 1) I have a relational database which has a very large data. (MS SQL
>> Server)
>> 2) I want to do analysis on these huge data  and want to generate reports
>> on it after analysis.
>> Like this I have to generate various reports based on different analysis.
>>
>> I tried to implement this using Hive. What I did is:
>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>> Thrift Server
>> 3) I am getting the correct result in table form, which I am expecting
>> 4) But the problem is that the time which require to execute is too much
>> long.
>>    (My complete program is executing in near about 3-4 hours on *small
>> amount of data*).
>>
>>
>>    I decided to do this using Hive.
>>     And as I told previously how much time Hive consumed for execution. my
>> organization is expecting to complete this task in near about less than
>> 1/2 hours
>>
>> Now after spending too much time for complete execution for this task what
>> should I do?
>> I want to ask one thing that:
>> *Is this Use Case is possible with Hive?* If possible what should I do in
>>
>> my program to increase the performance?
>> *And If not possible what is the other good way to implement this Use
>> Case?*
>>
>>
>> Please reply me.
>> Thanks
>>
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>
>
>
> --
> Nitin Pawar
>
>


-- 
Regards,
Bhavesh Shah

Re: Is my Use Case possible with Hive?

Posted by Nitin Pawar <ni...@gmail.com>.

how many # records?

what is your hadoop cluster setup? how many nodes?
if you are running hadoop on a single node setup with normal desktop, i
doubt it will be of any help.

You need a stronger cluster setup for better query runtimes and ofcourse
query optimization which I guess you would have already taken care.



On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <bh...@gmail.com>wrote:

> Hello all,
> My Use Case is:
> 1) I have a relational database which has a very large data. (MS SQL
> Server)
> 2) I want to do analysis on these huge data  and want to generate reports
> on it after analysis.
> Like this I have to generate various reports based on different analysis.
>
> I tried to implement this using Hive. What I did is:
> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
> 2) I wrote many queries in Hive which is executing using JDBC on Hive
> Thrift Server
> 3) I am getting the correct result in table form, which I am expecting
> 4) But the problem is that the time which require to execute is too much
> long.
>    (My complete program is executing in near about 3-4 hours on *small
> amount of data*).
>
>    I decided to do this using Hive.
>     And as I told previously how much time Hive consumed for execution. my
> organization is expecting to complete this task in near about less than
> 1/2 hours
>
> Now after spending too much time for complete execution for this task what
> should I do?
> I want to ask one thing that:
> *Is this Use Case is possible with Hive?* If possible what should I do in
> my program to increase the performance?
> *And If not possible what is the other good way to implement this Use
> Case?*
>
> Please reply me.
> Thanks
>
>
> --
> Regards,
> Bhavesh Shah
>



-- 
Nitin Pawar

Re: Is my Use Case possible with Hive?

Posted by Nitin Pawar <ni...@gmail.com>.

how many # records?

what is your hadoop cluster setup? how many nodes?
if you are running hadoop on a single node setup with normal desktop, i
doubt it will be of any help.

You need a stronger cluster setup for better query runtimes and ofcourse
query optimization which I guess you would have already taken care.



On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <bh...@gmail.com>wrote:

> Hello all,
> My Use Case is:
> 1) I have a relational database which has a very large data. (MS SQL
> Server)
> 2) I want to do analysis on these huge data  and want to generate reports
> on it after analysis.
> Like this I have to generate various reports based on different analysis.
>
> I tried to implement this using Hive. What I did is:
> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
> 2) I wrote many queries in Hive which is executing using JDBC on Hive
> Thrift Server
> 3) I am getting the correct result in table form, which I am expecting
> 4) But the problem is that the time which require to execute is too much
> long.
>    (My complete program is executing in near about 3-4 hours on *small
> amount of data*).
>
>    I decided to do this using Hive.
>     And as I told previously how much time Hive consumed for execution. my
> organization is expecting to complete this task in near about less than
> 1/2 hours
>
> Now after spending too much time for complete execution for this task what
> should I do?
> I want to ask one thing that:
> *Is this Use Case is possible with Hive?* If possible what should I do in
> my program to increase the performance?
> *And If not possible what is the other good way to implement this Use
> Case?*
>
> Please reply me.
> Thanks
>
>
> --
> Regards,
> Bhavesh Shah
>



-- 
Nitin Pawar

Re: Is my Use Case possible with Hive?

Posted by Jagat <ja...@gmail.com>.

Hello

I would attack the problem by following

Do all max possible aggregations calculations inside pig and then store
final results in table to query using hive.

Use partitions in hive by having idea from where clause in queries

Use index to improve performance further

-----------
Sent from Mobile , short and crisp.
On 14-May-2012 12:39 PM, "Bhavesh Shah" <bh...@gmail.com> wrote:

> Hello all,
> My Use Case is:
> 1) I have a relational database which has a very large data. (MS SQL
> Server)
> 2) I want to do analysis on these huge data  and want to generate reports
> on it after analysis.
> Like this I have to generate various reports based on different analysis.
>
> I tried to implement this using Hive. What I did is:
> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
> 2) I wrote many queries in Hive which is executing using JDBC on Hive
> Thrift Server
> 3) I am getting the correct result in table form, which I am expecting
> 4) But the problem is that the time which require to execute is too much
> long.
>    (My complete program is executing in near about 3-4 hours on *small
> amount of data*).
>
>    I decided to do this using Hive.
>     And as I told previously how much time Hive consumed for execution. my
> organization is expecting to complete this task in near about less than
> 1/2 hours
>
> Now after spending too much time for complete execution for this task what
> should I do?
> I want to ask one thing that:
> *Is this Use Case is possible with Hive?* If possible what should I do in
> my program to increase the performance?
> *And If not possible what is the other good way to implement this Use
> Case?*
>
> Please reply me.
> Thanks
>
>
> --
> Regards,
> Bhavesh Shah
>

Re: Is my Use Case possible with Hive?

Posted by Nitin Pawar <ni...@gmail.com>.

how many # records?

what is your hadoop cluster setup? how many nodes?
if you are running hadoop on a single node setup with normal desktop, i
doubt it will be of any help.

You need a stronger cluster setup for better query runtimes and ofcourse
query optimization which I guess you would have already taken care.



On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <bh...@gmail.com>wrote:

> Hello all,
> My Use Case is:
> 1) I have a relational database which has a very large data. (MS SQL
> Server)
> 2) I want to do analysis on these huge data  and want to generate reports
> on it after analysis.
> Like this I have to generate various reports based on different analysis.
>
> I tried to implement this using Hive. What I did is:
> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
> 2) I wrote many queries in Hive which is executing using JDBC on Hive
> Thrift Server
> 3) I am getting the correct result in table form, which I am expecting
> 4) But the problem is that the time which require to execute is too much
> long.
>    (My complete program is executing in near about 3-4 hours on *small
> amount of data*).
>
>    I decided to do this using Hive.
>     And as I told previously how much time Hive consumed for execution. my
> organization is expecting to complete this task in near about less than
> 1/2 hours
>
> Now after spending too much time for complete execution for this task what
> should I do?
> I want to ask one thing that:
> *Is this Use Case is possible with Hive?* If possible what should I do in
> my program to increase the performance?
> *And If not possible what is the other good way to implement this Use
> Case?*
>
> Please reply me.
> Thanks
>
>
> --
> Regards,
> Bhavesh Shah
>



-- 
Nitin Pawar