You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ibrahim Yakti <iy...@souq.com> on 2012/12/24 14:08:39 UTC

Reflect MySQL updates into Hive

Hi All,

We are new to hadoop and hive, we are trying to use hive to
run analytical queries and we are using sqoop to import data into hive, in
our RDBMS the data updated very frequently and this needs to be reflected
to hive. Hive does not support update/delete but there are many workarounds
to do this task.

What's in our mind is importing all the tables into hive as is, then we
build the required tables for reporting.

My questions are:

   1. What is the best way to reflect MySQL updates into Hive with minimal
   resources?
   2. Is sqoop the right tool to do the ETL?
   3. Is Hive the right tool to do this kind of queries or we should search
   for alternatives?

Any hint will be useful, thanks in advanced.

--
Ibrahim

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
My problem is in eliminating the duplicates and only keep the correct data,
any advise please?
On Dec 24, 2012 9:13 PM, "Dean Wampler" <de...@thinkbiganalytics.com>
wrote:

> Looks good, but a few suggestions. If you can eliminate duplicates, etc.
> as you ingest the data into HDFS, that would eliminate a cleansing step.
> Note that if the target directory in HDFS IS the specified location for an
> external Hive table/partition, then there will be no separate step to "load
> in Hive as External Table". It's already there!
>
> Your "transform data..." is a common pattern; stage "raw" data into a
> location, then use Hive (or Pig) to transform it into the final form and
> INSERT INTO the final Hive table.
>
> dean
>
> On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> Thanks Dean for the great reply, setting incremental import should be
>> easy, if I partitioned my data how hive will get me the updated rows only
>> considering that the row may have multiple fields that will be updated over
>> time? and how will I manage the tables that based on multiple sources? and
>> do you recommend to import the data to HDFS instead of Hive directly? Won't
>> we have a lot of duplicated records then?
>>
>> Regarding automation we were thinking to use sqoop-job command or crons
>> as you suggested.
>>
>> So, the suggested flow as follows:
>>
>> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in
>> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive
>> tables for reporting.
>>
>>
>> Correct?
>>
>> Appreciated.
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <
>> dean.wampler@thinkbiganalytics.com> wrote:
>>
>>> This is not as hard as it sounds. The hardest part is setting up the
>>> incremental query against your MySQL database. Then you can write the
>>> results to new files in the HDFS directory for the table and Hive will see
>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>>> care how many files are in the directory. The trick is to avoid lots of
>>> little files.
>>>
>>> As others have suggested, you should consider partitioning the data,
>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>>> day, then use year/month/day partitioning to speed up your Hive queries.
>>> You'll need to add the partitions to the table as you go, but actually, you
>>> can add those once a month, for example, for all partitions. Hive doesn't
>>> care if the partition directories don't exist yet or the directories are
>>> empty. I also recommend using an external table, which gives you more
>>> flexibility on directory layout, etc.
>>>
>>> Sqoop might be the easiest tool for importing the data, as it will even
>>> generate a Hive table schema from the original MySQL table. However, that
>>> feature may not be useful in this case, as you already have the table.
>>>
>>> I think Oozie is horribly complex to use and overkill for this purpose.
>>> A simple bash script triggered periodically by cron is all you need. If you
>>> aren't using a partitioned table, you have a single sqoop command to run.
>>> If you have partitioned data, you'll also need a hive statement in the
>>> script to create the partition, unless you do those in batch once a month,
>>> etc., etc.
>>>
>>> Hope this helps,
>>> dean
>>>
>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We are new to hadoop and hive, we are trying to use hive to
>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>> to do this task.
>>>>
>>>> What's in our mind is importing all the tables into hive as is, then we
>>>> build the required tables for reporting.
>>>>
>>>> My questions are:
>>>>
>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>    minimal resources?
>>>>    2. Is sqoop the right tool to do the ETL?
>>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>>    search for alternatives?
>>>>
>>>> Any hint will be useful, thanks in advanced.
>>>>
>>>> --
>>>> Ibrahim
>>>>
>>>
>>>
>>>
>>> --
>>> *Dean Wampler, Ph.D.*
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>

Re: Reflect MySQL updates into Hive

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
Looks good, but a few suggestions. If you can eliminate duplicates, etc. as
you ingest the data into HDFS, that would eliminate a cleansing step. Note
that if the target directory in HDFS IS the specified location for an
external Hive table/partition, then there will be no separate step to "load
in Hive as External Table". It's already there!

Your "transform data..." is a common pattern; stage "raw" data into a
location, then use Hive (or Pig) to transform it into the final form and
INSERT INTO the final Hive table.

dean

On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti <iy...@souq.com> wrote:

> Thanks Dean for the great reply, setting incremental import should be
> easy, if I partitioned my data how hive will get me the updated rows only
> considering that the row may have multiple fields that will be updated over
> time? and how will I manage the tables that based on multiple sources? and
> do you recommend to import the data to HDFS instead of Hive directly? Won't
> we have a lot of duplicated records then?
>
> Regarding automation we were thinking to use sqoop-job command or crons as
> you suggested.
>
> So, the suggested flow as follows:
>
> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in
> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive
> tables for reporting.
>
>
> Correct?
>
> Appreciated.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <
> dean.wampler@thinkbiganalytics.com> wrote:
>
>> This is not as hard as it sounds. The hardest part is setting up the
>> incremental query against your MySQL database. Then you can write the
>> results to new files in the HDFS directory for the table and Hive will see
>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>> care how many files are in the directory. The trick is to avoid lots of
>> little files.
>>
>> As others have suggested, you should consider partitioning the data,
>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>> day, then use year/month/day partitioning to speed up your Hive queries.
>> You'll need to add the partitions to the table as you go, but actually, you
>> can add those once a month, for example, for all partitions. Hive doesn't
>> care if the partition directories don't exist yet or the directories are
>> empty. I also recommend using an external table, which gives you more
>> flexibility on directory layout, etc.
>>
>> Sqoop might be the easiest tool for importing the data, as it will even
>> generate a Hive table schema from the original MySQL table. However, that
>> feature may not be useful in this case, as you already have the table.
>>
>> I think Oozie is horribly complex to use and overkill for this purpose. A
>> simple bash script triggered periodically by cron is all you need. If you
>> aren't using a partitioned table, you have a single sqoop command to run.
>> If you have partitioned data, you'll also need a hive statement in the
>> script to create the partition, unless you do those in batch once a month,
>> etc., etc.
>>
>> Hope this helps,
>> dean
>>
>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>>> Hi All,
>>>
>>> We are new to hadoop and hive, we are trying to use hive to
>>> run analytical queries and we are using sqoop to import data into hive, in
>>> our RDBMS the data updated very frequently and this needs to be reflected
>>> to hive. Hive does not support update/delete but there are many workarounds
>>> to do this task.
>>>
>>> What's in our mind is importing all the tables into hive as is, then we
>>> build the required tables for reporting.
>>>
>>> My questions are:
>>>
>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>    minimal resources?
>>>    2. Is sqoop the right tool to do the ETL?
>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>    search for alternatives?
>>>
>>> Any hint will be useful, thanks in advanced.
>>>
>>> --
>>> Ibrahim
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
Thanks Dean for the great reply, setting incremental import should be easy,
if I partitioned my data how hive will get me the updated rows only
considering that the row may have multiple fields that will be updated over
time? and how will I manage the tables that based on multiple sources? and
do you recommend to import the data to HDFS instead of Hive directly? Won't
we have a lot of duplicated records then?

Regarding automation we were thinking to use sqoop-job command or crons as
you suggested.

So, the suggested flow as follows:

MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in Hive
as External Table ---(Transform Data & Join Tables)--> Save it in Hive
tables for reporting.


Correct?

Appreciated.


--
Ibrahim


On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <
dean.wampler@thinkbiganalytics.com> wrote:

> This is not as hard as it sounds. The hardest part is setting up the
> incremental query against your MySQL database. Then you can write the
> results to new files in the HDFS directory for the table and Hive will see
> them immediately. Yes, even though Hive doesn't support updates, it doesn't
> care how many files are in the directory. The trick is to avoid lots of
> little files.
>
> As others have suggested, you should consider partitioning the data,
> perhaps by time. Say you import about a few HDFS blocks-worth of data each
> day, then use year/month/day partitioning to speed up your Hive queries.
> You'll need to add the partitions to the table as you go, but actually, you
> can add those once a month, for example, for all partitions. Hive doesn't
> care if the partition directories don't exist yet or the directories are
> empty. I also recommend using an external table, which gives you more
> flexibility on directory layout, etc.
>
> Sqoop might be the easiest tool for importing the data, as it will even
> generate a Hive table schema from the original MySQL table. However, that
> feature may not be useful in this case, as you already have the table.
>
> I think Oozie is horribly complex to use and overkill for this purpose. A
> simple bash script triggered periodically by cron is all you need. If you
> aren't using a partitioned table, you have a single sqoop command to run.
> If you have partitioned data, you'll also need a hive statement in the
> script to create the partition, unless you do those in batch once a month,
> etc., etc.
>
> Hope this helps,
> dean
>
> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> Hi All,
>>
>> We are new to hadoop and hive, we are trying to use hive to
>> run analytical queries and we are using sqoop to import data into hive, in
>> our RDBMS the data updated very frequently and this needs to be reflected
>> to hive. Hive does not support update/delete but there are many workarounds
>> to do this task.
>>
>> What's in our mind is importing all the tables into hive as is, then we
>> build the required tables for reporting.
>>
>> My questions are:
>>
>>    1. What is the best way to reflect MySQL updates into Hive with
>>    minimal resources?
>>    2. Is sqoop the right tool to do the ETL?
>>    3. Is Hive the right tool to do this kind of queries or we should
>>    search for alternatives?
>>
>> Any hint will be useful, thanks in advanced.
>>
>> --
>> Ibrahim
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
Thanks Mohammad, I will be waiting ... meanwhile, seems I will get into
HBase and give it a try ... unless someone advised with something
better/easier.


--
Ibrahim


On Wed, Dec 26, 2012 at 5:52 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Ibrahim,
>
>            Sorry for the late response. Those replies were for Kshiva. I
> saw his question(exactly same as this one) multiple times on Pig mailing
> list as well, so just thought of giving some pointers to him on how to use
> the list. I should have specified it properly. Apologies for creating the
> nuisance.
>
> Coming back to the actual point, yes the flow is fine. Normally people do
> it like this. But I was looking for some alternate way, so that we don't
> have to go through this long process for the updates. I'll let you know
> once I find something useful. But till now I haven't found anything better
> than whatever Dean sir has suggested. Please, do let me know if you find
> something before me.
>
> Many thanks.
>
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Wed, Dec 26, 2012 at 7:24 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> After more reading, a suggested scenario looks like:
>>
>> MySQL ---(Extract / Load)---> HDFS ---> Load into HBase --> Read as
>> external in Hive ---(Transform Data & Join Tables)--> Use hive for Joins &
>> Queries ---> Update HBase as needed & Reload in Hive.
>>
>> What do you think please?
>>
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Wed, Dec 26, 2012 at 9:27 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>>> Mohammad, I am not sure if the answers & the link were to me or to
>>> Kshiva's question.
>>>
>>> if I have partitioned my data based on status for example, when I run
>>> the update query it will add the updated data on a new partition (success
>>> or shipped for example) and it will keep the old data (confirmed or paid
>>> for example), right?
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Also, have a look at this :
>>>> http://www.catb.org/~esr/faqs/smart-questions.html
>>>>
>>>> Best Regards,
>>>> Tariq
>>>> +91-9741563634
>>>> https://mtariq.jux.com/
>>>>
>>>>
>>>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Have a look at Beeswax.
>>>>>
>>>>> BTW, do you have access to Google at your station?Same question on the
>>>>> Pig mailing list as well, that too twice.
>>>>>
>>>>> Best Regards,
>>>>> Tariq
>>>>> +91-9741563634
>>>>> https://mtariq.jux.com/
>>>>>
>>>>>
>>>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <ks...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Is there any Hive editors and where we can write 100 to 150 Hive
>>>>>> scripts I'm believing is not essay  to  do in CLI mode all scripts .
>>>>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <
>>>>>> dean.wampler@thinkbiganalytics.com> wrote:
>>>>>>
>>>>>>> This is not as hard as it sounds. The hardest part is setting up the
>>>>>>> incremental query against your MySQL database. Then you can write the
>>>>>>> results to new files in the HDFS directory for the table and Hive will see
>>>>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>>>>>>> care how many files are in the directory. The trick is to avoid lots of
>>>>>>> little files.
>>>>>>>
>>>>>>> As others have suggested, you should consider partitioning the data,
>>>>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>>>>>>> day, then use year/month/day partitioning to speed up your Hive queries.
>>>>>>> You'll need to add the partitions to the table as you go, but actually, you
>>>>>>> can add those once a month, for example, for all partitions. Hive doesn't
>>>>>>> care if the partition directories don't exist yet or the directories are
>>>>>>> empty. I also recommend using an external table, which gives you more
>>>>>>> flexibility on directory layout, etc.
>>>>>>>
>>>>>>> Sqoop might be the easiest tool for importing the data, as it will
>>>>>>> even generate a Hive table schema from the original MySQL table. However,
>>>>>>> that feature may not be useful in this case, as you already have the table.
>>>>>>>
>>>>>>> I think Oozie is horribly complex to use and overkill for this
>>>>>>> purpose. A simple bash script triggered periodically by cron is all you
>>>>>>> need. If you aren't using a partitioned table, you have a single sqoop
>>>>>>> command to run. If you have partitioned data, you'll also need a hive
>>>>>>> statement in the script to create the partition, unless you do those in
>>>>>>> batch once a month, etc., etc.
>>>>>>>
>>>>>>> Hope this helps,
>>>>>>> dean
>>>>>>>
>>>>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>>>>> to do this task.
>>>>>>>>
>>>>>>>> What's in our mind is importing all the tables into hive as is,
>>>>>>>> then we build the required tables for reporting.
>>>>>>>>
>>>>>>>> My questions are:
>>>>>>>>
>>>>>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>>>>>    minimal resources?
>>>>>>>>    2. Is sqoop the right tool to do the ETL?
>>>>>>>>    3. Is Hive the right tool to do this kind of queries or we
>>>>>>>>    should search for alternatives?
>>>>>>>>
>>>>>>>> Any hint will be useful, thanks in advanced.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ibrahim
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Dean Wampler, Ph.D.*
>>>>>>> thinkbiganalytics.com
>>>>>>> +1-312-339-1330
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Ibrahim,

           Sorry for the late response. Those replies were for Kshiva. I
saw his question(exactly same as this one) multiple times on Pig mailing
list as well, so just thought of giving some pointers to him on how to use
the list. I should have specified it properly. Apologies for creating the
nuisance.

Coming back to the actual point, yes the flow is fine. Normally people do
it like this. But I was looking for some alternate way, so that we don't
have to go through this long process for the updates. I'll let you know
once I find something useful. But till now I haven't found anything better
than whatever Dean sir has suggested. Please, do let me know if you find
something before me.

Many thanks.


Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Wed, Dec 26, 2012 at 7:24 PM, Ibrahim Yakti <iy...@souq.com> wrote:

> After more reading, a suggested scenario looks like:
>
> MySQL ---(Extract / Load)---> HDFS ---> Load into HBase --> Read as
> external in Hive ---(Transform Data & Join Tables)--> Use hive for Joins &
> Queries ---> Update HBase as needed & Reload in Hive.
>
> What do you think please?
>
>
>
> --
> Ibrahim
>
>
> On Wed, Dec 26, 2012 at 9:27 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> Mohammad, I am not sure if the answers & the link were to me or to
>> Kshiva's question.
>>
>> if I have partitioned my data based on status for example, when I run the
>> update query it will add the updated data on a new partition (success or
>> shipped for example) and it will keep the old data (confirmed or paid for
>> example), right?
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Also, have a look at this :
>>> http://www.catb.org/~esr/faqs/smart-questions.html
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Have a look at Beeswax.
>>>>
>>>> BTW, do you have access to Google at your station?Same question on the
>>>> Pig mailing list as well, that too twice.
>>>>
>>>> Best Regards,
>>>> Tariq
>>>> +91-9741563634
>>>> https://mtariq.jux.com/
>>>>
>>>>
>>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <ks...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is there any Hive editors and where we can write 100 to 150 Hive
>>>>> scripts I'm believing is not essay  to  do in CLI mode all scripts .
>>>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <
>>>>> dean.wampler@thinkbiganalytics.com> wrote:
>>>>>
>>>>>> This is not as hard as it sounds. The hardest part is setting up the
>>>>>> incremental query against your MySQL database. Then you can write the
>>>>>> results to new files in the HDFS directory for the table and Hive will see
>>>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>>>>>> care how many files are in the directory. The trick is to avoid lots of
>>>>>> little files.
>>>>>>
>>>>>> As others have suggested, you should consider partitioning the data,
>>>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>>>>>> day, then use year/month/day partitioning to speed up your Hive queries.
>>>>>> You'll need to add the partitions to the table as you go, but actually, you
>>>>>> can add those once a month, for example, for all partitions. Hive doesn't
>>>>>> care if the partition directories don't exist yet or the directories are
>>>>>> empty. I also recommend using an external table, which gives you more
>>>>>> flexibility on directory layout, etc.
>>>>>>
>>>>>> Sqoop might be the easiest tool for importing the data, as it will
>>>>>> even generate a Hive table schema from the original MySQL table. However,
>>>>>> that feature may not be useful in this case, as you already have the table.
>>>>>>
>>>>>> I think Oozie is horribly complex to use and overkill for this
>>>>>> purpose. A simple bash script triggered periodically by cron is all you
>>>>>> need. If you aren't using a partitioned table, you have a single sqoop
>>>>>> command to run. If you have partitioned data, you'll also need a hive
>>>>>> statement in the script to create the partition, unless you do those in
>>>>>> batch once a month, etc., etc.
>>>>>>
>>>>>> Hope this helps,
>>>>>> dean
>>>>>>
>>>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>>>> to do this task.
>>>>>>>
>>>>>>> What's in our mind is importing all the tables into hive as is, then
>>>>>>> we build the required tables for reporting.
>>>>>>>
>>>>>>> My questions are:
>>>>>>>
>>>>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>>>>    minimal resources?
>>>>>>>    2. Is sqoop the right tool to do the ETL?
>>>>>>>    3. Is Hive the right tool to do this kind of queries or we
>>>>>>>    should search for alternatives?
>>>>>>>
>>>>>>> Any hint will be useful, thanks in advanced.
>>>>>>>
>>>>>>> --
>>>>>>> Ibrahim
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Dean Wampler, Ph.D.*
>>>>>> thinkbiganalytics.com
>>>>>> +1-312-339-1330
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
After more reading, a suggested scenario looks like:

MySQL ---(Extract / Load)---> HDFS ---> Load into HBase --> Read as
external in Hive ---(Transform Data & Join Tables)--> Use hive for Joins &
Queries ---> Update HBase as needed & Reload in Hive.

What do you think please?



--
Ibrahim


On Wed, Dec 26, 2012 at 9:27 AM, Ibrahim Yakti <iy...@souq.com> wrote:

> Mohammad, I am not sure if the answers & the link were to me or to
> Kshiva's question.
>
> if I have partitioned my data based on status for example, when I run the
> update query it will add the updated data on a new partition (success or
> shipped for example) and it will keep the old data (confirmed or paid for
> example), right?
>
>
> --
> Ibrahim
>
>
> On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Also, have a look at this :
>> http://www.catb.org/~esr/faqs/smart-questions.html
>>
>> Best Regards,
>> Tariq
>> +91-9741563634
>> https://mtariq.jux.com/
>>
>>
>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Have a look at Beeswax.
>>>
>>> BTW, do you have access to Google at your station?Same question on the
>>> Pig mailing list as well, that too twice.
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <ks...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there any Hive editors and where we can write 100 to 150 Hive
>>>> scripts I'm believing is not essay  to  do in CLI mode all scripts .
>>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks
>>>>
>>>>
>>>> Thanks
>>>>
>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <
>>>> dean.wampler@thinkbiganalytics.com> wrote:
>>>>
>>>>> This is not as hard as it sounds. The hardest part is setting up the
>>>>> incremental query against your MySQL database. Then you can write the
>>>>> results to new files in the HDFS directory for the table and Hive will see
>>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>>>>> care how many files are in the directory. The trick is to avoid lots of
>>>>> little files.
>>>>>
>>>>> As others have suggested, you should consider partitioning the data,
>>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>>>>> day, then use year/month/day partitioning to speed up your Hive queries.
>>>>> You'll need to add the partitions to the table as you go, but actually, you
>>>>> can add those once a month, for example, for all partitions. Hive doesn't
>>>>> care if the partition directories don't exist yet or the directories are
>>>>> empty. I also recommend using an external table, which gives you more
>>>>> flexibility on directory layout, etc.
>>>>>
>>>>> Sqoop might be the easiest tool for importing the data, as it will
>>>>> even generate a Hive table schema from the original MySQL table. However,
>>>>> that feature may not be useful in this case, as you already have the table.
>>>>>
>>>>> I think Oozie is horribly complex to use and overkill for this
>>>>> purpose. A simple bash script triggered periodically by cron is all you
>>>>> need. If you aren't using a partitioned table, you have a single sqoop
>>>>> command to run. If you have partitioned data, you'll also need a hive
>>>>> statement in the script to create the partition, unless you do those in
>>>>> batch once a month, etc., etc.
>>>>>
>>>>> Hope this helps,
>>>>> dean
>>>>>
>>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>>> to do this task.
>>>>>>
>>>>>> What's in our mind is importing all the tables into hive as is, then
>>>>>> we build the required tables for reporting.
>>>>>>
>>>>>> My questions are:
>>>>>>
>>>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>>>    minimal resources?
>>>>>>    2. Is sqoop the right tool to do the ETL?
>>>>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>>>>    search for alternatives?
>>>>>>
>>>>>> Any hint will be useful, thanks in advanced.
>>>>>>
>>>>>> --
>>>>>> Ibrahim
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Dean Wampler, Ph.D.*
>>>>> thinkbiganalytics.com
>>>>> +1-312-339-1330
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
Mohammad, I am not sure if the answers & the link were to me or to Kshiva's
question.

if I have partitioned my data based on status for example, when I run the
update query it will add the updated data on a new partition (success or
shipped for example) and it will keep the old data (confirmed or paid for
example), right?


--
Ibrahim


On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Also, have a look at this :
> http://www.catb.org/~esr/faqs/smart-questions.html
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Have a look at Beeswax.
>>
>> BTW, do you have access to Google at your station?Same question on the
>> Pig mailing list as well, that too twice.
>>
>> Best Regards,
>> Tariq
>> +91-9741563634
>> https://mtariq.jux.com/
>>
>>
>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <ks...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Is there any Hive editors and where we can write 100 to 150 Hive scripts
>>> I'm believing is not essay  to  do in CLI mode all scripts .
>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks
>>>
>>>
>>> Thanks
>>>
>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <
>>> dean.wampler@thinkbiganalytics.com> wrote:
>>>
>>>> This is not as hard as it sounds. The hardest part is setting up the
>>>> incremental query against your MySQL database. Then you can write the
>>>> results to new files in the HDFS directory for the table and Hive will see
>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>>>> care how many files are in the directory. The trick is to avoid lots of
>>>> little files.
>>>>
>>>> As others have suggested, you should consider partitioning the data,
>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>>>> day, then use year/month/day partitioning to speed up your Hive queries.
>>>> You'll need to add the partitions to the table as you go, but actually, you
>>>> can add those once a month, for example, for all partitions. Hive doesn't
>>>> care if the partition directories don't exist yet or the directories are
>>>> empty. I also recommend using an external table, which gives you more
>>>> flexibility on directory layout, etc.
>>>>
>>>> Sqoop might be the easiest tool for importing the data, as it will even
>>>> generate a Hive table schema from the original MySQL table. However, that
>>>> feature may not be useful in this case, as you already have the table.
>>>>
>>>> I think Oozie is horribly complex to use and overkill for this purpose.
>>>> A simple bash script triggered periodically by cron is all you need. If you
>>>> aren't using a partitioned table, you have a single sqoop command to run.
>>>> If you have partitioned data, you'll also need a hive statement in the
>>>> script to create the partition, unless you do those in batch once a month,
>>>> etc., etc.
>>>>
>>>> Hope this helps,
>>>> dean
>>>>
>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>> to do this task.
>>>>>
>>>>> What's in our mind is importing all the tables into hive as is, then
>>>>> we build the required tables for reporting.
>>>>>
>>>>> My questions are:
>>>>>
>>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>>    minimal resources?
>>>>>    2. Is sqoop the right tool to do the ETL?
>>>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>>>    search for alternatives?
>>>>>
>>>>> Any hint will be useful, thanks in advanced.
>>>>>
>>>>> --
>>>>> Ibrahim
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Dean Wampler, Ph.D.*
>>>> thinkbiganalytics.com
>>>> +1-312-339-1330
>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Mohammad Tariq <do...@gmail.com>.
Also, have a look at this :
http://www.catb.org/~esr/faqs/smart-questions.html

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Have a look at Beeswax.
>
> BTW, do you have access to Google at your station?Same question on the Pig
> mailing list as well, that too twice.
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <ks...@gmail.com> wrote:
>
>> Hi,
>>
>> Is there any Hive editors and where we can write 100 to 150 Hive scripts
>> I'm believing is not essay  to  do in CLI mode all scripts .
>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks
>>
>>
>> Thanks
>>
>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <
>> dean.wampler@thinkbiganalytics.com> wrote:
>>
>>> This is not as hard as it sounds. The hardest part is setting up the
>>> incremental query against your MySQL database. Then you can write the
>>> results to new files in the HDFS directory for the table and Hive will see
>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>>> care how many files are in the directory. The trick is to avoid lots of
>>> little files.
>>>
>>> As others have suggested, you should consider partitioning the data,
>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>>> day, then use year/month/day partitioning to speed up your Hive queries.
>>> You'll need to add the partitions to the table as you go, but actually, you
>>> can add those once a month, for example, for all partitions. Hive doesn't
>>> care if the partition directories don't exist yet or the directories are
>>> empty. I also recommend using an external table, which gives you more
>>> flexibility on directory layout, etc.
>>>
>>> Sqoop might be the easiest tool for importing the data, as it will even
>>> generate a Hive table schema from the original MySQL table. However, that
>>> feature may not be useful in this case, as you already have the table.
>>>
>>> I think Oozie is horribly complex to use and overkill for this purpose.
>>> A simple bash script triggered periodically by cron is all you need. If you
>>> aren't using a partitioned table, you have a single sqoop command to run.
>>> If you have partitioned data, you'll also need a hive statement in the
>>> script to create the partition, unless you do those in batch once a month,
>>> etc., etc.
>>>
>>> Hope this helps,
>>> dean
>>>
>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We are new to hadoop and hive, we are trying to use hive to
>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>> to do this task.
>>>>
>>>> What's in our mind is importing all the tables into hive as is, then we
>>>> build the required tables for reporting.
>>>>
>>>> My questions are:
>>>>
>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>    minimal resources?
>>>>    2. Is sqoop the right tool to do the ETL?
>>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>>    search for alternatives?
>>>>
>>>> Any hint will be useful, thanks in advanced.
>>>>
>>>> --
>>>> Ibrahim
>>>>
>>>
>>>
>>>
>>> --
>>> *Dean Wampler, Ph.D.*
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Mohammad Tariq <do...@gmail.com>.
Have a look at Beeswax.

BTW, do you have access to Google at your station?Same question on the Pig
mailing list as well, that too twice.

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <ks...@gmail.com> wrote:

> Hi,
>
> Is there any Hive editors and where we can write 100 to 150 Hive scripts
> I'm believing is not essay  to  do in CLI mode all scripts .
> Like IDE for JAVA /TOAD for SQL pls advice , many thanks
>
>
> Thanks
>
> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <
> dean.wampler@thinkbiganalytics.com> wrote:
>
>> This is not as hard as it sounds. The hardest part is setting up the
>> incremental query against your MySQL database. Then you can write the
>> results to new files in the HDFS directory for the table and Hive will see
>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>> care how many files are in the directory. The trick is to avoid lots of
>> little files.
>>
>> As others have suggested, you should consider partitioning the data,
>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>> day, then use year/month/day partitioning to speed up your Hive queries.
>> You'll need to add the partitions to the table as you go, but actually, you
>> can add those once a month, for example, for all partitions. Hive doesn't
>> care if the partition directories don't exist yet or the directories are
>> empty. I also recommend using an external table, which gives you more
>> flexibility on directory layout, etc.
>>
>> Sqoop might be the easiest tool for importing the data, as it will even
>> generate a Hive table schema from the original MySQL table. However, that
>> feature may not be useful in this case, as you already have the table.
>>
>> I think Oozie is horribly complex to use and overkill for this purpose. A
>> simple bash script triggered periodically by cron is all you need. If you
>> aren't using a partitioned table, you have a single sqoop command to run.
>> If you have partitioned data, you'll also need a hive statement in the
>> script to create the partition, unless you do those in batch once a month,
>> etc., etc.
>>
>> Hope this helps,
>> dean
>>
>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>>> Hi All,
>>>
>>> We are new to hadoop and hive, we are trying to use hive to
>>> run analytical queries and we are using sqoop to import data into hive, in
>>> our RDBMS the data updated very frequently and this needs to be reflected
>>> to hive. Hive does not support update/delete but there are many workarounds
>>> to do this task.
>>>
>>> What's in our mind is importing all the tables into hive as is, then we
>>> build the required tables for reporting.
>>>
>>> My questions are:
>>>
>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>    minimal resources?
>>>    2. Is sqoop the right tool to do the ETL?
>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>    search for alternatives?
>>>
>>> Any hint will be useful, thanks in advanced.
>>>
>>> --
>>> Ibrahim
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Kshiva Kps <ks...@gmail.com>.
Hi,

Is there any Hive editors and where we can write 100 to 150 Hive scripts
I'm believing is not essay  to  do in CLI mode all scripts .
Like IDE for JAVA /TOAD for SQL pls advice , many thanks


Thanks

On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <
dean.wampler@thinkbiganalytics.com> wrote:

> This is not as hard as it sounds. The hardest part is setting up the
> incremental query against your MySQL database. Then you can write the
> results to new files in the HDFS directory for the table and Hive will see
> them immediately. Yes, even though Hive doesn't support updates, it doesn't
> care how many files are in the directory. The trick is to avoid lots of
> little files.
>
> As others have suggested, you should consider partitioning the data,
> perhaps by time. Say you import about a few HDFS blocks-worth of data each
> day, then use year/month/day partitioning to speed up your Hive queries.
> You'll need to add the partitions to the table as you go, but actually, you
> can add those once a month, for example, for all partitions. Hive doesn't
> care if the partition directories don't exist yet or the directories are
> empty. I also recommend using an external table, which gives you more
> flexibility on directory layout, etc.
>
> Sqoop might be the easiest tool for importing the data, as it will even
> generate a Hive table schema from the original MySQL table. However, that
> feature may not be useful in this case, as you already have the table.
>
> I think Oozie is horribly complex to use and overkill for this purpose. A
> simple bash script triggered periodically by cron is all you need. If you
> aren't using a partitioned table, you have a single sqoop command to run.
> If you have partitioned data, you'll also need a hive statement in the
> script to create the partition, unless you do those in batch once a month,
> etc., etc.
>
> Hope this helps,
> dean
>
> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> Hi All,
>>
>> We are new to hadoop and hive, we are trying to use hive to
>> run analytical queries and we are using sqoop to import data into hive, in
>> our RDBMS the data updated very frequently and this needs to be reflected
>> to hive. Hive does not support update/delete but there are many workarounds
>> to do this task.
>>
>> What's in our mind is importing all the tables into hive as is, then we
>> build the required tables for reporting.
>>
>> My questions are:
>>
>>    1. What is the best way to reflect MySQL updates into Hive with
>>    minimal resources?
>>    2. Is sqoop the right tool to do the ETL?
>>    3. Is Hive the right tool to do this kind of queries or we should
>>    search for alternatives?
>>
>> Any hint will be useful, thanks in advanced.
>>
>> --
>> Ibrahim
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>

Re: Reflect MySQL updates into Hive

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
This is not as hard as it sounds. The hardest part is setting up the
incremental query against your MySQL database. Then you can write the
results to new files in the HDFS directory for the table and Hive will see
them immediately. Yes, even though Hive doesn't support updates, it doesn't
care how many files are in the directory. The trick is to avoid lots of
little files.

As others have suggested, you should consider partitioning the data,
perhaps by time. Say you import about a few HDFS blocks-worth of data each
day, then use year/month/day partitioning to speed up your Hive queries.
You'll need to add the partitions to the table as you go, but actually, you
can add those once a month, for example, for all partitions. Hive doesn't
care if the partition directories don't exist yet or the directories are
empty. I also recommend using an external table, which gives you more
flexibility on directory layout, etc.

Sqoop might be the easiest tool for importing the data, as it will even
generate a Hive table schema from the original MySQL table. However, that
feature may not be useful in this case, as you already have the table.

I think Oozie is horribly complex to use and overkill for this purpose. A
simple bash script triggered periodically by cron is all you need. If you
aren't using a partitioned table, you have a single sqoop command to run.
If you have partitioned data, you'll also need a hive statement in the
script to create the partition, unless you do those in batch once a month,
etc., etc.

Hope this helps,
dean

On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iy...@souq.com> wrote:

> Hi All,
>
> We are new to hadoop and hive, we are trying to use hive to
> run analytical queries and we are using sqoop to import data into hive, in
> our RDBMS the data updated very frequently and this needs to be reflected
> to hive. Hive does not support update/delete but there are many workarounds
> to do this task.
>
> What's in our mind is importing all the tables into hive as is, then we
> build the required tables for reporting.
>
> My questions are:
>
>    1. What is the best way to reflect MySQL updates into Hive with
>    minimal resources?
>    2. Is sqoop the right tool to do the ETL?
>    3. Is Hive the right tool to do this kind of queries or we should
>    search for alternatives?
>
> Any hint will be useful, thanks in advanced.
>
> --
> Ibrahim
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
What if you have many columns that need to be updated? a simple example:
confirmation date, payment status(es) + status update time, delivery, ...
etc ?  on what base you will set your partition and how the old data will
be removed because the updated data will be reloaded in other partition if
I partition using payment status for example.


--
Ibrahim


On Mon, Dec 24, 2012 at 5:25 PM, Mohammad Tariq <do...@gmail.com> wrote:

> I was actually trying to answer you actual questions. What are you
> currently doing to tackle this update problem and what kind of tweak you
> are looking for?There is no direct solution to achieve this,
> out-of-the-box, as you have said.
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Mon, Dec 24, 2012 at 7:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> This already done, but Hive does not support update nor deletion of data,
>> so when I import the data after specific "last_update_time" records, hive
>> will append it not replace.
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> You can use Apache Oozie to schedule your imports.
>>>
>>> Alternatively, you can have an additional column in your SQL table, say
>>> LastUpdatedTime or something. As soon as there is a change in this column
>>> you can start the import from this point. This way you don't have to import
>>> all the things everytime there is a change in your table. You just have to
>>> move only the most recent data, say only the 'delta' amount of data.
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>>> My question was how to reflect MySQL updates to hadoop/hive, this is
>>>> our problem now.
>>>>
>>>>
>>>> --
>>>> Ibrahim
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Cool. Then go ahead :)
>>>>>
>>>>> Just in case you need something in realtime, you can have a look at
>>>>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>>>>
>>>>> Best Regards,
>>>>> Tariq
>>>>> +91-9741563634
>>>>> https://mtariq.jux.com/
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>
>>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS
>>>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>>>>>> computing, as I said we want to use Hive for analytical queries.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ibrahim
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Hello Ibrahim,
>>>>>>>
>>>>>>>      A quick questio. Are you planning to replace your SQL DB with
>>>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant
>>>>>>> for entirely different purposes. Hive is for batch processing and not for
>>>>>>> real time system. So if you are requirements involve real time things, you
>>>>>>> need to think before moving ahead.
>>>>>>>
>>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Tariq
>>>>>>> +91-9741563634
>>>>>>> https://mtariq.jux.com/
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>>>>> to do this task.
>>>>>>>>
>>>>>>>> What's in our mind is importing all the tables into hive as is,
>>>>>>>> then we build the required tables for reporting.
>>>>>>>>
>>>>>>>> My questions are:
>>>>>>>>
>>>>>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>>>>>    minimal resources?
>>>>>>>>    2. Is sqoop the right tool to do the ETL?
>>>>>>>>    3. Is Hive the right tool to do this kind of queries or we
>>>>>>>>    should search for alternatives?
>>>>>>>>
>>>>>>>> Any hint will be useful, thanks in advanced.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ibrahim
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Mohammad Tariq <do...@gmail.com>.
I was actually trying to answer you actual questions. What are you
currently doing to tackle this update problem and what kind of tweak you
are looking for?There is no direct solution to achieve this,
out-of-the-box, as you have said.

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Mon, Dec 24, 2012 at 7:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:

> This already done, but Hive does not support update nor deletion of data,
> so when I import the data after specific "last_update_time" records, hive
> will append it not replace.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You can use Apache Oozie to schedule your imports.
>>
>> Alternatively, you can have an additional column in your SQL table, say
>> LastUpdatedTime or something. As soon as there is a change in this column
>> you can start the import from this point. This way you don't have to import
>> all the things everytime there is a change in your table. You just have to
>> move only the most recent data, say only the 'delta' amount of data.
>>
>> Best Regards,
>> Tariq
>> +91-9741563634
>> https://mtariq.jux.com/
>>
>>
>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>>> My question was how to reflect MySQL updates to hadoop/hive, this is our
>>> problem now.
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Cool. Then go ahead :)
>>>>
>>>> Just in case you need something in realtime, you can have a look at
>>>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>>>
>>>> Best Regards,
>>>> Tariq
>>>> +91-9741563634
>>>> https://mtariq.jux.com/
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>>
>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS
>>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>>>>> computing, as I said we want to use Hive for analytical queries.
>>>>>
>>>>>
>>>>> --
>>>>> Ibrahim
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Hello Ibrahim,
>>>>>>
>>>>>>      A quick questio. Are you planning to replace your SQL DB with
>>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant
>>>>>> for entirely different purposes. Hive is for batch processing and not for
>>>>>> real time system. So if you are requirements involve real time things, you
>>>>>> need to think before moving ahead.
>>>>>>
>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Best Regards,
>>>>>> Tariq
>>>>>> +91-9741563634
>>>>>> https://mtariq.jux.com/
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>>>> to do this task.
>>>>>>>
>>>>>>> What's in our mind is importing all the tables into hive as is, then
>>>>>>> we build the required tables for reporting.
>>>>>>>
>>>>>>> My questions are:
>>>>>>>
>>>>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>>>>    minimal resources?
>>>>>>>    2. Is sqoop the right tool to do the ETL?
>>>>>>>    3. Is Hive the right tool to do this kind of queries or we
>>>>>>>    should search for alternatives?
>>>>>>>
>>>>>>> Any hint will be useful, thanks in advanced.
>>>>>>>
>>>>>>> --
>>>>>>> Ibrahim
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Mohammad Tariq <do...@gmail.com>.
Good points by Edward. I specially love the point no. 2.

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Mon, Dec 24, 2012 at 7:58 PM, Edward Capriolo <ed...@gmail.com>wrote:

> You can only do the last_update idea if this is an insert only dataset.
>
> If your table takes updates you need a different strategy.
> 1) full dumps every interval.
> 2) Using a storage handler like hbase or cassandra that takes update
> operations
>
>
>
> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <
> jeremiah.peschka@gmail.com> wrote:
>
>> If it were me, I would find a way to identify the partitions that have
>> modified data and then re-load a subset of the partitions (only the ones
>> with changes) on a regular basis. Instead of updating/deleting data, you'll
>> be re-loading specific partitions as an all or nothing action.
>>
>> On Monday, December 24, 2012, Ibrahim Yakti wrote:
>>
>>> This already done, but Hive does not support update nor deletion of
>>> data, so when I import the data after specific "last_update_time" records,
>>> hive will append it not replace.
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>> You can use Apache Oozie to schedule your imports.
>>>
>>> Alternatively, you can have an additional column in your SQL table, say
>>> LastUpdatedTime or something. As soon as there is a change in this column
>>> you can start the import from this point. This way you don't have to import
>>> all the things everytime there is a change in your table. You just have to
>>> move only the most recent data, say only the 'delta' amount of data.
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>> My question was how to reflect MySQL updates to hadoop/hive, this is our
>>> problem now.
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>> Cool. Then go ahead :)
>>>
>>> Just in case you need something in realtime, you can have a look at
>>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>>> computing, as I said we want to use Hive for analytical queries.
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>> Hello Ibrahim,
>>>
>>>      A quick questio. Are you planning to replace your SQL DB with Hive?
>>> If that is the case, I would not suggest to do that. Both are meant for
>>> entirely different purposes. Hive is for batch processing and not for real
>>> time system. So if you are requirements involve real time things, you need
>>> to think before moving ahead.
>>>
>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>
>>> HTH
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>> Hi All,
>>>
>>> We are new to hadoop and hive, we are trying to use hive to
>>> run analytical queries and we are using sqoop to import data into hive, in
>>> our RDBMS the data updated very frequently and this needs to be reflected
>>> to hive. Hive does not support update/delete but there are many workarounds
>>> to do this task.
>>>
>>> What's in our mind is importing all the
>>>
>>>
>>
>> --
>> ---
>> Jeremiah Peschka
>> Founder, Brent Ozar Unlimited
>> Microsoft SQL Server MVP
>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
Bottom line: use sqoop to import data into HBase/Cassandra for storage and
use Hive to query the data using external tables, did I miss anything?


--
Ibrahim


On Mon, Dec 24, 2012 at 5:37 PM, Edward Capriolo <ed...@gmail.com>wrote:

> Hive can not easily handle updates. The most creative way I saw this done
> was someone managed to capture all updates and then use union queries which
> rewrote the same hive table with the newest value.
>
> original + union delta + column with latest timestamp = new original
>
> But that is a lot of processing especially when you may not have man
> updates. Hive has storage handlers that let you lay a table over hbase and
> cassandra data. Store your data in those systems, they take updates, then
> use hive to query those.
>
>
> On Mon, Dec 24, 2012 at 9:29 AM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> Edward can you explain more please? you suggesting that I should use
>> HBase for such tasks instead of hive?
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <ed...@gmail.com>wrote:
>>
>>> You can only do the last_update idea if this is an insert only dataset.
>>>
>>> If your table takes updates you need a different strategy.
>>> 1) full dumps every interval.
>>> 2) Using a storage handler like hbase or cassandra that takes update
>>> operations
>>>
>>>
>>>
>>> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <
>>> jeremiah.peschka@gmail.com> wrote:
>>>
>>>> If it were me, I would find a way to identify the partitions that have
>>>> modified data and then re-load a subset of the partitions (only the ones
>>>> with changes) on a regular basis. Instead of updating/deleting data, you'll
>>>> be re-loading specific partitions as an all or nothing action.
>>>>
>>>> On Monday, December 24, 2012, Ibrahim Yakti wrote:
>>>>
>>>>> This already done, but Hive does not support update nor deletion of
>>>>> data, so when I import the data after specific "last_update_time" records,
>>>>> hive will append it not replace.
>>>>>
>>>>>
>>>>> --
>>>>> Ibrahim
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>> You can use Apache Oozie to schedule your imports.
>>>>>
>>>>> Alternatively, you can have an additional column in your SQL table,
>>>>> say LastUpdatedTime or something. As soon as there is a change in this
>>>>> column you can start the import from this point. This way you don't have to
>>>>> import all the things everytime there is a change in your table. You just
>>>>> have to move only the most recent data, say only the 'delta' amount of data.
>>>>>
>>>>> Best Regards,
>>>>> Tariq
>>>>> +91-9741563634
>>>>> https://mtariq.jux.com/
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>
>>>>> My question was how to reflect MySQL updates to hadoop/hive, this is
>>>>> our problem now.
>>>>>
>>>>>
>>>>> --
>>>>> Ibrahim
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>> Cool. Then go ahead :)
>>>>>
>>>>> Just in case you need something in realtime, you can have a look at
>>>>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>>>>
>>>>> Best Regards,
>>>>> Tariq
>>>>> +91-9741563634
>>>>> https://mtariq.jux.com/
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>
>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS
>>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>>>>> computing, as I said we want to use Hive for analytical queries.
>>>>>
>>>>>
>>>>> --
>>>>> Ibrahim
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>> Hello Ibrahim,
>>>>>
>>>>>      A quick questio. Are you planning to replace your SQL DB with
>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant
>>>>> for entirely different purposes. Hive is for batch processing and not for
>>>>> real time system. So if you are requirements involve real time things, you
>>>>> need to think before moving ahead.
>>>>>
>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>>>
>>>>> HTH
>>>>>
>>>>> Best Regards,
>>>>> Tariq
>>>>> +91-9741563634
>>>>> https://mtariq.jux.com/
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>> to do this task.
>>>>>
>>>>> What's in our mind is importing all the
>>>>>
>>>>>
>>>>
>>>> --
>>>> ---
>>>> Jeremiah Peschka
>>>> Founder, Brent Ozar Unlimited
>>>> Microsoft SQL Server MVP
>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Edward Capriolo <ed...@gmail.com>.
Hive can not easily handle updates. The most creative way I saw this done
was someone managed to capture all updates and then use union queries which
rewrote the same hive table with the newest value.

original + union delta + column with latest timestamp = new original

But that is a lot of processing especially when you may not have man
updates. Hive has storage handlers that let you lay a table over hbase and
cassandra data. Store your data in those systems, they take updates, then
use hive to query those.

On Mon, Dec 24, 2012 at 9:29 AM, Ibrahim Yakti <iy...@souq.com> wrote:

> Edward can you explain more please? you suggesting that I should use HBase
> for such tasks instead of hive?
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> You can only do the last_update idea if this is an insert only dataset.
>>
>> If your table takes updates you need a different strategy.
>> 1) full dumps every interval.
>> 2) Using a storage handler like hbase or cassandra that takes update
>> operations
>>
>>
>>
>> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <
>> jeremiah.peschka@gmail.com> wrote:
>>
>>> If it were me, I would find a way to identify the partitions that have
>>> modified data and then re-load a subset of the partitions (only the ones
>>> with changes) on a regular basis. Instead of updating/deleting data, you'll
>>> be re-loading specific partitions as an all or nothing action.
>>>
>>> On Monday, December 24, 2012, Ibrahim Yakti wrote:
>>>
>>>> This already done, but Hive does not support update nor deletion of
>>>> data, so when I import the data after specific "last_update_time" records,
>>>> hive will append it not replace.
>>>>
>>>>
>>>> --
>>>> Ibrahim
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>> You can use Apache Oozie to schedule your imports.
>>>>
>>>> Alternatively, you can have an additional column in your SQL table, say
>>>> LastUpdatedTime or something. As soon as there is a change in this column
>>>> you can start the import from this point. This way you don't have to import
>>>> all the things everytime there is a change in your table. You just have to
>>>> move only the most recent data, say only the 'delta' amount of data.
>>>>
>>>> Best Regards,
>>>> Tariq
>>>> +91-9741563634
>>>> https://mtariq.jux.com/
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>>
>>>> My question was how to reflect MySQL updates to hadoop/hive, this is
>>>> our problem now.
>>>>
>>>>
>>>> --
>>>> Ibrahim
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>> Cool. Then go ahead :)
>>>>
>>>> Just in case you need something in realtime, you can have a look at
>>>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>>>
>>>> Best Regards,
>>>> Tariq
>>>> +91-9741563634
>>>> https://mtariq.jux.com/
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>>
>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
>>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>>>> computing, as I said we want to use Hive for analytical queries.
>>>>
>>>>
>>>> --
>>>> Ibrahim
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>> Hello Ibrahim,
>>>>
>>>>      A quick questio. Are you planning to replace your SQL DB with
>>>> Hive? If that is the case, I would not suggest to do that. Both are meant
>>>> for entirely different purposes. Hive is for batch processing and not for
>>>> real time system. So if you are requirements involve real time things, you
>>>> need to think before moving ahead.
>>>>
>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>>
>>>> HTH
>>>>
>>>> Best Regards,
>>>> Tariq
>>>> +91-9741563634
>>>> https://mtariq.jux.com/
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> We are new to hadoop and hive, we are trying to use hive to
>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>> to do this task.
>>>>
>>>> What's in our mind is importing all the
>>>>
>>>>
>>>
>>> --
>>> ---
>>> Jeremiah Peschka
>>> Founder, Brent Ozar Unlimited
>>> Microsoft SQL Server MVP
>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
Edward can you explain more please? you suggesting that I should use HBase
for such tasks instead of hive?


--
Ibrahim


On Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <ed...@gmail.com>wrote:

> You can only do the last_update idea if this is an insert only dataset.
>
> If your table takes updates you need a different strategy.
> 1) full dumps every interval.
> 2) Using a storage handler like hbase or cassandra that takes update
> operations
>
>
>
> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <
> jeremiah.peschka@gmail.com> wrote:
>
>> If it were me, I would find a way to identify the partitions that have
>> modified data and then re-load a subset of the partitions (only the ones
>> with changes) on a regular basis. Instead of updating/deleting data, you'll
>> be re-loading specific partitions as an all or nothing action.
>>
>> On Monday, December 24, 2012, Ibrahim Yakti wrote:
>>
>>> This already done, but Hive does not support update nor deletion of
>>> data, so when I import the data after specific "last_update_time" records,
>>> hive will append it not replace.
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>> You can use Apache Oozie to schedule your imports.
>>>
>>> Alternatively, you can have an additional column in your SQL table, say
>>> LastUpdatedTime or something. As soon as there is a change in this column
>>> you can start the import from this point. This way you don't have to import
>>> all the things everytime there is a change in your table. You just have to
>>> move only the most recent data, say only the 'delta' amount of data.
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>> My question was how to reflect MySQL updates to hadoop/hive, this is our
>>> problem now.
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>> Cool. Then go ahead :)
>>>
>>> Just in case you need something in realtime, you can have a look at
>>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>>> computing, as I said we want to use Hive for analytical queries.
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>> Hello Ibrahim,
>>>
>>>      A quick questio. Are you planning to replace your SQL DB with Hive?
>>> If that is the case, I would not suggest to do that. Both are meant for
>>> entirely different purposes. Hive is for batch processing and not for real
>>> time system. So if you are requirements involve real time things, you need
>>> to think before moving ahead.
>>>
>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>
>>> HTH
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>> Hi All,
>>>
>>> We are new to hadoop and hive, we are trying to use hive to
>>> run analytical queries and we are using sqoop to import data into hive, in
>>> our RDBMS the data updated very frequently and this needs to be reflected
>>> to hive. Hive does not support update/delete but there are many workarounds
>>> to do this task.
>>>
>>> What's in our mind is importing all the
>>>
>>>
>>
>> --
>> ---
>> Jeremiah Peschka
>> Founder, Brent Ozar Unlimited
>> Microsoft SQL Server MVP
>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Edward Capriolo <ed...@gmail.com>.
You can only do the last_update idea if this is an insert only dataset.

If your table takes updates you need a different strategy.
1) full dumps every interval.
2) Using a storage handler like hbase or cassandra that takes update
operations


On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <
jeremiah.peschka@gmail.com> wrote:

> If it were me, I would find a way to identify the partitions that have
> modified data and then re-load a subset of the partitions (only the ones
> with changes) on a regular basis. Instead of updating/deleting data, you'll
> be re-loading specific partitions as an all or nothing action.
>
> On Monday, December 24, 2012, Ibrahim Yakti wrote:
>
>> This already done, but Hive does not support update nor deletion of data,
>> so when I import the data after specific "last_update_time" records, hive
>> will append it not replace.
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>> You can use Apache Oozie to schedule your imports.
>>
>> Alternatively, you can have an additional column in your SQL table, say
>> LastUpdatedTime or something. As soon as there is a change in this column
>> you can start the import from this point. This way you don't have to import
>> all the things everytime there is a change in your table. You just have to
>> move only the most recent data, say only the 'delta' amount of data.
>>
>> Best Regards,
>> Tariq
>> +91-9741563634
>> https://mtariq.jux.com/
>>
>>
>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>> My question was how to reflect MySQL updates to hadoop/hive, this is our
>> problem now.
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>> Cool. Then go ahead :)
>>
>> Just in case you need something in realtime, you can have a look at
>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>
>> Best Regards,
>> Tariq
>> +91-9741563634
>> https://mtariq.jux.com/
>>
>>
>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>> computing, as I said we want to use Hive for analytical queries.
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>> Hello Ibrahim,
>>
>>      A quick questio. Are you planning to replace your SQL DB with Hive?
>> If that is the case, I would not suggest to do that. Both are meant for
>> entirely different purposes. Hive is for batch processing and not for real
>> time system. So if you are requirements involve real time things, you need
>> to think before moving ahead.
>>
>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>
>> HTH
>>
>> Best Regards,
>> Tariq
>> +91-9741563634
>> https://mtariq.jux.com/
>>
>>
>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>> Hi All,
>>
>> We are new to hadoop and hive, we are trying to use hive to
>> run analytical queries and we are using sqoop to import data into hive, in
>> our RDBMS the data updated very frequently and this needs to be reflected
>> to hive. Hive does not support update/delete but there are many workarounds
>> to do this task.
>>
>> What's in our mind is importing all the
>>
>>
>
> --
> ---
> Jeremiah Peschka
> Founder, Brent Ozar Unlimited
> Microsoft SQL Server MVP
>
>

Re: Reflect MySQL updates into Hive

Posted by Jeremiah Peschka <je...@gmail.com>.
If it were me, I would find a way to identify the partitions that have
modified data and then re-load a subset of the partitions (only the ones
with changes) on a regular basis. Instead of updating/deleting data, you'll
be re-loading specific partitions as an all or nothing action.

On Monday, December 24, 2012, Ibrahim Yakti wrote:

> This already done, but Hive does not support update nor deletion of data,
> so when I import the data after specific "last_update_time" records, hive
> will append it not replace.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
> You can use Apache Oozie to schedule your imports.
>
> Alternatively, you can have an additional column in your SQL table, say
> LastUpdatedTime or something. As soon as there is a change in this column
> you can start the import from this point. This way you don't have to import
> all the things everytime there is a change in your table. You just have to
> move only the most recent data, say only the 'delta' amount of data.
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>
> My question was how to reflect MySQL updates to hadoop/hive, this is our
> problem now.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
> Cool. Then go ahead :)
>
> Just in case you need something in realtime, you can have a look at
> Impala.(I know nobody likes to get preached, but just in case ;) ).
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>
> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
> computing, as I said we want to use Hive for analytical queries.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
> Hello Ibrahim,
>
>      A quick questio. Are you planning to replace your SQL DB with Hive?
> If that is the case, I would not suggest to do that. Both are meant for
> entirely different purposes. Hive is for batch processing and not for real
> time system. So if you are requirements involve real time things, you need
> to think before moving ahead.
>
> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>
> HTH
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>
> Hi All,
>
> We are new to hadoop and hive, we are trying to use hive to
> run analytical queries and we are using sqoop to import data into hive, in
> our RDBMS the data updated very frequently and this needs to be reflected
> to hive. Hive does not support update/delete but there are many workarounds
> to do this task.
>
> What's in our mind is importing all the
>
>

-- 
---
Jeremiah Peschka
Founder, Brent Ozar Unlimited
Microsoft SQL Server MVP

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
This already done, but Hive does not support update nor deletion of data,
so when I import the data after specific "last_update_time" records, hive
will append it not replace.


--
Ibrahim


On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You can use Apache Oozie to schedule your imports.
>
> Alternatively, you can have an additional column in your SQL table, say
> LastUpdatedTime or something. As soon as there is a change in this column
> you can start the import from this point. This way you don't have to import
> all the things everytime there is a change in your table. You just have to
> move only the most recent data, say only the 'delta' amount of data.
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> My question was how to reflect MySQL updates to hadoop/hive, this is our
>> problem now.
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Cool. Then go ahead :)
>>>
>>> Just in case you need something in realtime, you can have a look at
>>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
>>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>>>> computing, as I said we want to use Hive for analytical queries.
>>>>
>>>>
>>>> --
>>>> Ibrahim
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Hello Ibrahim,
>>>>>
>>>>>      A quick questio. Are you planning to replace your SQL DB with
>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant
>>>>> for entirely different purposes. Hive is for batch processing and not for
>>>>> real time system. So if you are requirements involve real time things, you
>>>>> need to think before moving ahead.
>>>>>
>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>>>
>>>>> HTH
>>>>>
>>>>> Best Regards,
>>>>> Tariq
>>>>> +91-9741563634
>>>>> https://mtariq.jux.com/
>>>>>
>>>>>
>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>>> to do this task.
>>>>>>
>>>>>> What's in our mind is importing all the tables into hive as is, then
>>>>>> we build the required tables for reporting.
>>>>>>
>>>>>> My questions are:
>>>>>>
>>>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>>>    minimal resources?
>>>>>>    2. Is sqoop the right tool to do the ETL?
>>>>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>>>>    search for alternatives?
>>>>>>
>>>>>> Any hint will be useful, thanks in advanced.
>>>>>>
>>>>>> --
>>>>>> Ibrahim
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Mohammad Tariq <do...@gmail.com>.
You can use Apache Oozie to schedule your imports.

Alternatively, you can have an additional column in your SQL table, say
LastUpdatedTime or something. As soon as there is a change in this column
you can start the import from this point. This way you don't have to import
all the things everytime there is a change in your table. You just have to
move only the most recent data, say only the 'delta' amount of data.

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <iy...@souq.com> wrote:

> My question was how to reflect MySQL updates to hadoop/hive, this is our
> problem now.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Cool. Then go ahead :)
>>
>> Just in case you need something in realtime, you can have a look at
>> Impala.(I know nobody likes to get preached, but just in case ;) ).
>>
>> Best Regards,
>> Tariq
>> +91-9741563634
>> https://mtariq.jux.com/
>>
>>
>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>>> computing, as I said we want to use Hive for analytical queries.
>>>
>>>
>>> --
>>> Ibrahim
>>>
>>>
>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello Ibrahim,
>>>>
>>>>      A quick questio. Are you planning to replace your SQL DB with
>>>> Hive? If that is the case, I would not suggest to do that. Both are meant
>>>> for entirely different purposes. Hive is for batch processing and not for
>>>> real time system. So if you are requirements involve real time things, you
>>>> need to think before moving ahead.
>>>>
>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>>
>>>> HTH
>>>>
>>>> Best Regards,
>>>> Tariq
>>>> +91-9741563634
>>>> https://mtariq.jux.com/
>>>>
>>>>
>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We are new to hadoop and hive, we are trying to use hive to
>>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>>> to do this task.
>>>>>
>>>>> What's in our mind is importing all the tables into hive as is, then
>>>>> we build the required tables for reporting.
>>>>>
>>>>> My questions are:
>>>>>
>>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>>    minimal resources?
>>>>>    2. Is sqoop the right tool to do the ETL?
>>>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>>>    search for alternatives?
>>>>>
>>>>> Any hint will be useful, thanks in advanced.
>>>>>
>>>>> --
>>>>> Ibrahim
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
My question was how to reflect MySQL updates to hadoop/hive, this is our
problem now.


--
Ibrahim


On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Cool. Then go ahead :)
>
> Just in case you need something in realtime, you can have a look at
> Impala.(I know nobody likes to get preached, but just in case ;) ).
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
>> computing, as I said we want to use Hive for analytical queries.
>>
>>
>> --
>> Ibrahim
>>
>>
>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello Ibrahim,
>>>
>>>      A quick questio. Are you planning to replace your SQL DB with Hive?
>>> If that is the case, I would not suggest to do that. Both are meant for
>>> entirely different purposes. Hive is for batch processing and not for real
>>> time system. So if you are requirements involve real time things, you need
>>> to think before moving ahead.
>>>
>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>>
>>> HTH
>>>
>>> Best Regards,
>>> Tariq
>>> +91-9741563634
>>> https://mtariq.jux.com/
>>>
>>>
>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We are new to hadoop and hive, we are trying to use hive to
>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>> to do this task.
>>>>
>>>> What's in our mind is importing all the tables into hive as is, then we
>>>> build the required tables for reporting.
>>>>
>>>> My questions are:
>>>>
>>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>>    minimal resources?
>>>>    2. Is sqoop the right tool to do the ETL?
>>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>>    search for alternatives?
>>>>
>>>> Any hint will be useful, thanks in advanced.
>>>>
>>>> --
>>>> Ibrahim
>>>>
>>>
>>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Mohammad Tariq <do...@gmail.com>.
Cool. Then go ahead :)

Just in case you need something in realtime, you can have a look at
Impala.(I know nobody likes to get preached, but just in case ;) ).

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <iy...@souq.com> wrote:

> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
> computing, as I said we want to use Hive for analytical queries.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Ibrahim,
>>
>>      A quick questio. Are you planning to replace your SQL DB with Hive?
>> If that is the case, I would not suggest to do that. Both are meant for
>> entirely different purposes. Hive is for batch processing and not for real
>> time system. So if you are requirements involve real time things, you need
>> to think before moving ahead.
>>
>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>>
>> HTH
>>
>> Best Regards,
>> Tariq
>> +91-9741563634
>> https://mtariq.jux.com/
>>
>>
>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>>
>>> Hi All,
>>>
>>> We are new to hadoop and hive, we are trying to use hive to
>>> run analytical queries and we are using sqoop to import data into hive, in
>>> our RDBMS the data updated very frequently and this needs to be reflected
>>> to hive. Hive does not support update/delete but there are many workarounds
>>> to do this task.
>>>
>>> What's in our mind is importing all the tables into hive as is, then we
>>> build the required tables for reporting.
>>>
>>> My questions are:
>>>
>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>    minimal resources?
>>>    2. Is sqoop the right tool to do the ETL?
>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>    search for alternatives?
>>>
>>> Any hint will be useful, thanks in advanced.
>>>
>>> --
>>> Ibrahim
>>>
>>
>>
>

Re: Reflect MySQL updates into Hive

Posted by Ibrahim Yakti <iy...@souq.com>.
Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
Hive. Hadoop/Hive will be used as Data Warehouse & batch processing
computing, as I said we want to use Hive for analytical queries.


--
Ibrahim


On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Ibrahim,
>
>      A quick questio. Are you planning to replace your SQL DB with Hive?
> If that is the case, I would not suggest to do that. Both are meant for
> entirely different purposes. Hive is for batch processing and not for real
> time system. So if you are requirements involve real time things, you need
> to think before moving ahead.
>
> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.
>
> HTH
>
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:
>
>> Hi All,
>>
>> We are new to hadoop and hive, we are trying to use hive to
>> run analytical queries and we are using sqoop to import data into hive, in
>> our RDBMS the data updated very frequently and this needs to be reflected
>> to hive. Hive does not support update/delete but there are many workarounds
>> to do this task.
>>
>> What's in our mind is importing all the tables into hive as is, then we
>> build the required tables for reporting.
>>
>> My questions are:
>>
>>    1. What is the best way to reflect MySQL updates into Hive with
>>    minimal resources?
>>    2. Is sqoop the right tool to do the ETL?
>>    3. Is Hive the right tool to do this kind of queries or we should
>>    search for alternatives?
>>
>> Any hint will be useful, thanks in advanced.
>>
>> --
>> Ibrahim
>>
>
>

Re: Reflect MySQL updates into Hive

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Ibrahim,

     A quick questio. Are you planning to replace your SQL DB with Hive? If
that is the case, I would not suggest to do that. Both are meant for
entirely different purposes. Hive is for batch processing and not for real
time system. So if you are requirements involve real time things, you need
to think before moving ahead.

Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.

HTH

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <iy...@souq.com> wrote:

> Hi All,
>
> We are new to hadoop and hive, we are trying to use hive to
> run analytical queries and we are using sqoop to import data into hive, in
> our RDBMS the data updated very frequently and this needs to be reflected
> to hive. Hive does not support update/delete but there are many workarounds
> to do this task.
>
> What's in our mind is importing all the tables into hive as is, then we
> build the required tables for reporting.
>
> My questions are:
>
>    1. What is the best way to reflect MySQL updates into Hive with
>    minimal resources?
>    2. Is sqoop the right tool to do the ETL?
>    3. Is Hive the right tool to do this kind of queries or we should
>    search for alternatives?
>
> Any hint will be useful, thanks in advanced.
>
> --
> Ibrahim
>