You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by Zoltán Tóth-Czifra <zo...@softonic.com> on 2012/09/12 16:48:24 UTC

Throttling inserts to avoid replication lags

Hi guys,

We are using Sqoop (cdh3u3) to export Hive tables to relational databases. Usually these databases are only used by business intelligence to further analyze and filter the data. However, in certain cases we need to export to relational databases that are heavily accessed by our products and users.

Our concern is that Sqoop exports would interfere with this random access of our users. Tempotal inconsistency of the data can be solved with a staging table and an atomic swap, however, we are concerned about the replication lag between the master and the slaves.

If we write large data quickly with Sqoop to the master (even to a staging table), that takes time to be replicated to the slaves (minutes) and causes an inconsistency we can't allow, that is, other writes from our users will be queued up. I wonder if any of you had similar problems. We are talking about a MySQL cluster by the way.

For what I know, Sqoop doesn't have any built-in throttle funcionality (for example a delay between inserts). We have been thinking to solve this with a proxy, but the existing solutions on the market are very incomplete.

Any other idea? The more transparent the best.

Thanks!

Re: Throttling inserts to avoid replication lags

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Hi Zoltan,
thank you for your patch, would you mind putting it to sqoop review board (https://reviews.apache.org)?

Jarcec

On Sun, Sep 16, 2012 at 04:19:53PM +0000, Zoltán Tóth-Czifra wrote:
> FYI, I found a simple way to implement this and created an issue to Sqoop with a patch.
> 
> Let's see if it gets accepted.
> 
> https://issues.apache.org/jira/browse/SQOOP-604
> ________________________________________
> From: Zoltán Tóth-Czifra [zoltan.tothczifra@softonic.com]
> Sent: Friday, September 14, 2012 12:35 PM
> To: Jarek Jarcec Cecho; user@sqoop.apache.org
> Subject: RE: Throttling inserts to avoid replication lags
> 
> Hi Jarcec,
> 
> Thank you very much for your answer! I really appreciate that you are thinking with me.
> Regarding trhe number of mappers to export, yes, we can keep it low, but as you said, Sqoop will try its best for the highest throughput so even one mapper can cause replication lag.
> 
> Your idea of the non-replicated tables could work, but I'm almost sure we'll need to discard it, because it's impossible to maintain with a few hundred machines, all constantly changing, adding new servers, creating new exports, etc...
> 
> The solutions we had in mind so far:
> 
> MySQL Proxy
> http://dev.mysql.com/downloads/mysql-proxy/
> It is an unofficial project for MySQL, and it seems to be sopped somehow. It doesn't seem to support throttling our of the box, but in theory with using Lua scripts one can write a system to limit the number of queries. This, however, is not a guarantee to limit data throughput (imagine one huge insert with thousands of lines...) and doesn't seem to be ready for production
> 
> Message Queues
> We had in mind a solution where we completely discard Sqoop and write our own solution which somehow puts exported lines from Hive to a message queue and there we can already process it the way we want. I see this very complex and costly solution.
> 
> Contributing to Sqoop
> This is what I see now as the best option - creating our own branch of Sqoop and adding the throttling feature.
> 
> If anyone has something else in mind, it's really appreciated.
> 
> Thanks!
> ________________________________________
> From: Jarek Jarcec Cecho [jarcec@apache.org]
> Sent: Thursday, September 13, 2012 12:19 PM
> To: user@sqoop.apache.org
> Subject: Re: Throttling inserts to avoid replication lags
> 
> Hi Zoltan,
> Sqoop is trying for the best throughput to move data from source to destination, so your issue might be tricky to solve. I was thinking about it and I do have couple of ideas:
> 
> 1) Did you tried to limit number of concurrent connections using "-m" parameter?
> 
> 2) I can imagine that huge parallelism in Sqoop can make hard time for MySQL single threaded replication. Thinking out-of-the box, what about creating table that won't be replicated (mysql can limit replication on both database and table level) on all your nodes and performing your load to all of them (it doesn't matter whether sequentially or in parallel). Once every node will get the data, you can atomically switch the table on all nodes at once. I'm not sure whether it's feasible nor whether it will actually work. I'm just trying to help.
> 
> Jarcec
> 
> On Thu, Sep 13, 2012 at 08:41:13AM +0000, Zoltán Tóth-Czifra wrote:
> > Hi,
> >
> > Thank you for your answers!
> >
> > I have been reading about Sqoop2, but since it's still under development it doesn't really serve me. Besides, my problem is not limiting connections, but somehow limiting the throughput of even one connection.
> >
> > This problem might not be Sqoop-specific, but I wondered if anyone have faced this and solved it somehow.
> >
> > Thank you!
> > ________________________________________
> > From: Kathleen Ting [kathleen@apache.org]
> > Sent: Thursday, September 13, 2012 1:27 AM
> > To: user@sqoop.apache.org
> > Subject: Re: Throttling inserts to avoid replication lags
> >
> > Chuck, Zoltán,
> >
> > In Sqoop 2, it has been discussed that connections will allow the
> > specification of a resource policy in that resources will be managed
> > by limiting the total number of physical Connections open at one time
> > and with an option to disable Connections.
> >
> > More info: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
> >
> > Regards, Kathleen
> >
> > On Wed, Sep 12, 2012 at 8:08 AM, Connell, Chuck
> > <Ch...@nuance.com> wrote:
> > > In my opinion, this is not a Sqoop problem. It is related to the RDBMS and
> > > the way it handles high-volume updates. Those updates might be coming from
> > > Sqoop, or they might be coming from a realtime stock market price feed.
> > >
> > >
> > >
> > > I would go ahead and test the system as is. Let Sqoop do all its updates. If
> > > you actually have a problem with inconsistencies or poor performance, then I
> > > would deal with it as a purely MySQL issue.
> > >
> > >
> > >
> > > (A low-tech approach… run the sqoop jobs at night??)
> > >
> > >
> > >
> > > Chuck
> > >
> > >
> > >
> > >
> > >
> > > From: Zoltán Tóth-Czifra [mailto:zoltan.tothczifra@softonic.com]
> > > Sent: Wednesday, September 12, 2012 10:48 AM
> > > To: user@sqoop.apache.org
> > > Subject: Throttling inserts to avoid replication lags
> > >
> > >
> > >
> > > Hi guys,
> > >
> > >
> > >
> > > We are using Sqoop (cdh3u3) to export Hive tables to relational databases.
> > > Usually these databases are only used by business intelligence to further
> > > analyze and filter the data. However, in certain cases we need to export to
> > > relational databases that are heavily accessed by our products and users.
> > >
> > >
> > >
> > > Our concern is that Sqoop exports would interfere with this random access of
> > > our users. Tempotal inconsistency of the data can be solved with a staging
> > > table and an atomic swap, however, we are concerned about the replication
> > > lag between the master and the slaves.
> > >
> > >
> > >
> > > If we write large data quickly with Sqoop to the master (even to a staging
> > > table), that takes time to be replicated to the slaves (minutes) and causes
> > > an inconsistency we can't allow, that is, other writes from our users will
> > > be queued up. I wonder if any of you had similar problems. We are talking
> > > about a MySQL cluster by the way.
> > >
> > >
> > >
> > > For what I know, Sqoop doesn't have any built-in throttle funcionality (for
> > > example a delay between inserts). We have been thinking to solve this with a
> > > proxy, but the existing solutions on the market are very incomplete.
> > >
> > >
> > >
> > > Any other idea? The more transparent the best.
> > >
> > >
> > >
> > > Thanks!

RE: Throttling inserts to avoid replication lags

Posted by Zoltán Tóth-Czifra <zo...@softonic.com>.

FYI, I found a simple way to implement this and created an issue to Sqoop with a patch.

Let's see if it gets accepted.

https://issues.apache.org/jira/browse/SQOOP-604
________________________________________
From: Zoltán Tóth-Czifra [zoltan.tothczifra@softonic.com]
Sent: Friday, September 14, 2012 12:35 PM
To: Jarek Jarcec Cecho; user@sqoop.apache.org
Subject: RE: Throttling inserts to avoid replication lags

Hi Jarcec,

Thank you very much for your answer! I really appreciate that you are thinking with me.
Regarding trhe number of mappers to export, yes, we can keep it low, but as you said, Sqoop will try its best for the highest throughput so even one mapper can cause replication lag.

Your idea of the non-replicated tables could work, but I'm almost sure we'll need to discard it, because it's impossible to maintain with a few hundred machines, all constantly changing, adding new servers, creating new exports, etc...

The solutions we had in mind so far:

MySQL Proxy
http://dev.mysql.com/downloads/mysql-proxy/
It is an unofficial project for MySQL, and it seems to be sopped somehow. It doesn't seem to support throttling our of the box, but in theory with using Lua scripts one can write a system to limit the number of queries. This, however, is not a guarantee to limit data throughput (imagine one huge insert with thousands of lines...) and doesn't seem to be ready for production

Message Queues
We had in mind a solution where we completely discard Sqoop and write our own solution which somehow puts exported lines from Hive to a message queue and there we can already process it the way we want. I see this very complex and costly solution.

Contributing to Sqoop
This is what I see now as the best option - creating our own branch of Sqoop and adding the throttling feature.

If anyone has something else in mind, it's really appreciated.

Thanks!
________________________________________
From: Jarek Jarcec Cecho [jarcec@apache.org]
Sent: Thursday, September 13, 2012 12:19 PM
To: user@sqoop.apache.org
Subject: Re: Throttling inserts to avoid replication lags

Hi Zoltan,
Sqoop is trying for the best throughput to move data from source to destination, so your issue might be tricky to solve. I was thinking about it and I do have couple of ideas:

1) Did you tried to limit number of concurrent connections using "-m" parameter?

2) I can imagine that huge parallelism in Sqoop can make hard time for MySQL single threaded replication. Thinking out-of-the box, what about creating table that won't be replicated (mysql can limit replication on both database and table level) on all your nodes and performing your load to all of them (it doesn't matter whether sequentially or in parallel). Once every node will get the data, you can atomically switch the table on all nodes at once. I'm not sure whether it's feasible nor whether it will actually work. I'm just trying to help.

Jarcec

On Thu, Sep 13, 2012 at 08:41:13AM +0000, Zoltán Tóth-Czifra wrote:
> Hi,
>
> Thank you for your answers!
>
> I have been reading about Sqoop2, but since it's still under development it doesn't really serve me. Besides, my problem is not limiting connections, but somehow limiting the throughput of even one connection.
>
> This problem might not be Sqoop-specific, but I wondered if anyone have faced this and solved it somehow.
>
> Thank you!
> ________________________________________
> From: Kathleen Ting [kathleen@apache.org]
> Sent: Thursday, September 13, 2012 1:27 AM
> To: user@sqoop.apache.org
> Subject: Re: Throttling inserts to avoid replication lags
>
> Chuck, Zoltán,
>
> In Sqoop 2, it has been discussed that connections will allow the
> specification of a resource policy in that resources will be managed
> by limiting the total number of physical Connections open at one time
> and with an option to disable Connections.
>
> More info: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
>
> Regards, Kathleen
>
> On Wed, Sep 12, 2012 at 8:08 AM, Connell, Chuck
> <Ch...@nuance.com> wrote:
> > In my opinion, this is not a Sqoop problem. It is related to the RDBMS and
> > the way it handles high-volume updates. Those updates might be coming from
> > Sqoop, or they might be coming from a realtime stock market price feed.
> >
> >
> >
> > I would go ahead and test the system as is. Let Sqoop do all its updates. If
> > you actually have a problem with inconsistencies or poor performance, then I
> > would deal with it as a purely MySQL issue.
> >
> >
> >
> > (A low-tech approach… run the sqoop jobs at night??)
> >
> >
> >
> > Chuck
> >
> >
> >
> >
> >
> > From: Zoltán Tóth-Czifra [mailto:zoltan.tothczifra@softonic.com]
> > Sent: Wednesday, September 12, 2012 10:48 AM
> > To: user@sqoop.apache.org
> > Subject: Throttling inserts to avoid replication lags
> >
> >
> >
> > Hi guys,
> >
> >
> >
> > We are using Sqoop (cdh3u3) to export Hive tables to relational databases.
> > Usually these databases are only used by business intelligence to further
> > analyze and filter the data. However, in certain cases we need to export to
> > relational databases that are heavily accessed by our products and users.
> >
> >
> >
> > Our concern is that Sqoop exports would interfere with this random access of
> > our users. Tempotal inconsistency of the data can be solved with a staging
> > table and an atomic swap, however, we are concerned about the replication
> > lag between the master and the slaves.
> >
> >
> >
> > If we write large data quickly with Sqoop to the master (even to a staging
> > table), that takes time to be replicated to the slaves (minutes) and causes
> > an inconsistency we can't allow, that is, other writes from our users will
> > be queued up. I wonder if any of you had similar problems. We are talking
> > about a MySQL cluster by the way.
> >
> >
> >
> > For what I know, Sqoop doesn't have any built-in throttle funcionality (for
> > example a delay between inserts). We have been thinking to solve this with a
> > proxy, but the existing solutions on the market are very incomplete.
> >
> >
> >
> > Any other idea? The more transparent the best.
> >
> >
> >
> > Thanks!

RE: Throttling inserts to avoid replication lags

Posted by Zoltán Tóth-Czifra <zo...@softonic.com>.

Hi Jarcec,

Thank you very much for your answer! I really appreciate that you are thinking with me.
Regarding trhe number of mappers to export, yes, we can keep it low, but as you said, Sqoop will try its best for the highest throughput so even one mapper can cause replication lag.

Your idea of the non-replicated tables could work, but I'm almost sure we'll need to discard it, because it's impossible to maintain with a few hundred machines, all constantly changing, adding new servers, creating new exports, etc...

The solutions we had in mind so far:

MySQL Proxy
http://dev.mysql.com/downloads/mysql-proxy/
It is an unofficial project for MySQL, and it seems to be sopped somehow. It doesn't seem to support throttling our of the box, but in theory with using Lua scripts one can write a system to limit the number of queries. This, however, is not a guarantee to limit data throughput (imagine one huge insert with thousands of lines...) and doesn't seem to be ready for production

Message Queues
We had in mind a solution where we completely discard Sqoop and write our own solution which somehow puts exported lines from Hive to a message queue and there we can already process it the way we want. I see this very complex and costly solution.

Contributing to Sqoop
This is what I see now as the best option - creating our own branch of Sqoop and adding the throttling feature. 

If anyone has something else in mind, it's really appreciated.

Thanks!
________________________________________
From: Jarek Jarcec Cecho [jarcec@apache.org]
Sent: Thursday, September 13, 2012 12:19 PM
To: user@sqoop.apache.org
Subject: Re: Throttling inserts to avoid replication lags

Hi Zoltan,
Sqoop is trying for the best throughput to move data from source to destination, so your issue might be tricky to solve. I was thinking about it and I do have couple of ideas:

1) Did you tried to limit number of concurrent connections using "-m" parameter?

2) I can imagine that huge parallelism in Sqoop can make hard time for MySQL single threaded replication. Thinking out-of-the box, what about creating table that won't be replicated (mysql can limit replication on both database and table level) on all your nodes and performing your load to all of them (it doesn't matter whether sequentially or in parallel). Once every node will get the data, you can atomically switch the table on all nodes at once. I'm not sure whether it's feasible nor whether it will actually work. I'm just trying to help.

Jarcec

On Thu, Sep 13, 2012 at 08:41:13AM +0000, Zoltán Tóth-Czifra wrote:
> Hi,
>
> Thank you for your answers!
>
> I have been reading about Sqoop2, but since it's still under development it doesn't really serve me. Besides, my problem is not limiting connections, but somehow limiting the throughput of even one connection.
>
> This problem might not be Sqoop-specific, but I wondered if anyone have faced this and solved it somehow.
>
> Thank you!
> ________________________________________
> From: Kathleen Ting [kathleen@apache.org]
> Sent: Thursday, September 13, 2012 1:27 AM
> To: user@sqoop.apache.org
> Subject: Re: Throttling inserts to avoid replication lags
>
> Chuck, Zoltán,
>
> In Sqoop 2, it has been discussed that connections will allow the
> specification of a resource policy in that resources will be managed
> by limiting the total number of physical Connections open at one time
> and with an option to disable Connections.
>
> More info: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
>
> Regards, Kathleen
>
> On Wed, Sep 12, 2012 at 8:08 AM, Connell, Chuck
> <Ch...@nuance.com> wrote:
> > In my opinion, this is not a Sqoop problem. It is related to the RDBMS and
> > the way it handles high-volume updates. Those updates might be coming from
> > Sqoop, or they might be coming from a realtime stock market price feed.
> >
> >
> >
> > I would go ahead and test the system as is. Let Sqoop do all its updates. If
> > you actually have a problem with inconsistencies or poor performance, then I
> > would deal with it as a purely MySQL issue.
> >
> >
> >
> > (A low-tech approach… run the sqoop jobs at night??)
> >
> >
> >
> > Chuck
> >
> >
> >
> >
> >
> > From: Zoltán Tóth-Czifra [mailto:zoltan.tothczifra@softonic.com]
> > Sent: Wednesday, September 12, 2012 10:48 AM
> > To: user@sqoop.apache.org
> > Subject: Throttling inserts to avoid replication lags
> >
> >
> >
> > Hi guys,
> >
> >
> >
> > We are using Sqoop (cdh3u3) to export Hive tables to relational databases.
> > Usually these databases are only used by business intelligence to further
> > analyze and filter the data. However, in certain cases we need to export to
> > relational databases that are heavily accessed by our products and users.
> >
> >
> >
> > Our concern is that Sqoop exports would interfere with this random access of
> > our users. Tempotal inconsistency of the data can be solved with a staging
> > table and an atomic swap, however, we are concerned about the replication
> > lag between the master and the slaves.
> >
> >
> >
> > If we write large data quickly with Sqoop to the master (even to a staging
> > table), that takes time to be replicated to the slaves (minutes) and causes
> > an inconsistency we can't allow, that is, other writes from our users will
> > be queued up. I wonder if any of you had similar problems. We are talking
> > about a MySQL cluster by the way.
> >
> >
> >
> > For what I know, Sqoop doesn't have any built-in throttle funcionality (for
> > example a delay between inserts). We have been thinking to solve this with a
> > proxy, but the existing solutions on the market are very incomplete.
> >
> >
> >
> > Any other idea? The more transparent the best.
> >
> >
> >
> > Thanks!

Re: Throttling inserts to avoid replication lags

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Hi Zoltan,
Sqoop is trying for the best throughput to move data from source to destination, so your issue might be tricky to solve. I was thinking about it and I do have couple of ideas:

1) Did you tried to limit number of concurrent connections using "-m" parameter?

2) I can imagine that huge parallelism in Sqoop can make hard time for MySQL single threaded replication. Thinking out-of-the box, what about creating table that won't be replicated (mysql can limit replication on both database and table level) on all your nodes and performing your load to all of them (it doesn't matter whether sequentially or in parallel). Once every node will get the data, you can atomically switch the table on all nodes at once. I'm not sure whether it's feasible nor whether it will actually work. I'm just trying to help.

Jarcec

On Thu, Sep 13, 2012 at 08:41:13AM +0000, Zoltán Tóth-Czifra wrote:
> Hi,
> 
> Thank you for your answers!
> 
> I have been reading about Sqoop2, but since it's still under development it doesn't really serve me. Besides, my problem is not limiting connections, but somehow limiting the throughput of even one connection.
> 
> This problem might not be Sqoop-specific, but I wondered if anyone have faced this and solved it somehow.
> 
> Thank you!
> ________________________________________
> From: Kathleen Ting [kathleen@apache.org]
> Sent: Thursday, September 13, 2012 1:27 AM
> To: user@sqoop.apache.org
> Subject: Re: Throttling inserts to avoid replication lags
> 
> Chuck, Zoltán,
> 
> In Sqoop 2, it has been discussed that connections will allow the
> specification of a resource policy in that resources will be managed
> by limiting the total number of physical Connections open at one time
> and with an option to disable Connections.
> 
> More info: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
> 
> Regards, Kathleen
> 
> On Wed, Sep 12, 2012 at 8:08 AM, Connell, Chuck
> <Ch...@nuance.com> wrote:
> > In my opinion, this is not a Sqoop problem. It is related to the RDBMS and
> > the way it handles high-volume updates. Those updates might be coming from
> > Sqoop, or they might be coming from a realtime stock market price feed.
> >
> >
> >
> > I would go ahead and test the system as is. Let Sqoop do all its updates. If
> > you actually have a problem with inconsistencies or poor performance, then I
> > would deal with it as a purely MySQL issue.
> >
> >
> >
> > (A low-tech approach… run the sqoop jobs at night??)
> >
> >
> >
> > Chuck
> >
> >
> >
> >
> >
> > From: Zoltán Tóth-Czifra [mailto:zoltan.tothczifra@softonic.com]
> > Sent: Wednesday, September 12, 2012 10:48 AM
> > To: user@sqoop.apache.org
> > Subject: Throttling inserts to avoid replication lags
> >
> >
> >
> > Hi guys,
> >
> >
> >
> > We are using Sqoop (cdh3u3) to export Hive tables to relational databases.
> > Usually these databases are only used by business intelligence to further
> > analyze and filter the data. However, in certain cases we need to export to
> > relational databases that are heavily accessed by our products and users.
> >
> >
> >
> > Our concern is that Sqoop exports would interfere with this random access of
> > our users. Tempotal inconsistency of the data can be solved with a staging
> > table and an atomic swap, however, we are concerned about the replication
> > lag between the master and the slaves.
> >
> >
> >
> > If we write large data quickly with Sqoop to the master (even to a staging
> > table), that takes time to be replicated to the slaves (minutes) and causes
> > an inconsistency we can't allow, that is, other writes from our users will
> > be queued up. I wonder if any of you had similar problems. We are talking
> > about a MySQL cluster by the way.
> >
> >
> >
> > For what I know, Sqoop doesn't have any built-in throttle funcionality (for
> > example a delay between inserts). We have been thinking to solve this with a
> > proxy, but the existing solutions on the market are very incomplete.
> >
> >
> >
> > Any other idea? The more transparent the best.
> >
> >
> >
> > Thanks!

RE: Throttling inserts to avoid replication lags

Posted by Zoltán Tóth-Czifra <zo...@softonic.com>.

Hi,

Thank you for your answers!

I have been reading about Sqoop2, but since it's still under development it doesn't really serve me. Besides, my problem is not limiting connections, but somehow limiting the throughput of even one connection.

This problem might not be Sqoop-specific, but I wondered if anyone have faced this and solved it somehow.

Thank you!
________________________________________
From: Kathleen Ting [kathleen@apache.org]
Sent: Thursday, September 13, 2012 1:27 AM
To: user@sqoop.apache.org
Subject: Re: Throttling inserts to avoid replication lags

Chuck, Zoltán,

In Sqoop 2, it has been discussed that connections will allow the
specification of a resource policy in that resources will be managed
by limiting the total number of physical Connections open at one time
and with an option to disable Connections.

More info: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

Regards, Kathleen

On Wed, Sep 12, 2012 at 8:08 AM, Connell, Chuck
<Ch...@nuance.com> wrote:
> In my opinion, this is not a Sqoop problem. It is related to the RDBMS and
> the way it handles high-volume updates. Those updates might be coming from
> Sqoop, or they might be coming from a realtime stock market price feed.
>
>
>
> I would go ahead and test the system as is. Let Sqoop do all its updates. If
> you actually have a problem with inconsistencies or poor performance, then I
> would deal with it as a purely MySQL issue.
>
>
>
> (A low-tech approach… run the sqoop jobs at night??)
>
>
>
> Chuck
>
>
>
>
>
> From: Zoltán Tóth-Czifra [mailto:zoltan.tothczifra@softonic.com]
> Sent: Wednesday, September 12, 2012 10:48 AM
> To: user@sqoop.apache.org
> Subject: Throttling inserts to avoid replication lags
>
>
>
> Hi guys,
>
>
>
> We are using Sqoop (cdh3u3) to export Hive tables to relational databases.
> Usually these databases are only used by business intelligence to further
> analyze and filter the data. However, in certain cases we need to export to
> relational databases that are heavily accessed by our products and users.
>
>
>
> Our concern is that Sqoop exports would interfere with this random access of
> our users. Tempotal inconsistency of the data can be solved with a staging
> table and an atomic swap, however, we are concerned about the replication
> lag between the master and the slaves.
>
>
>
> If we write large data quickly with Sqoop to the master (even to a staging
> table), that takes time to be replicated to the slaves (minutes) and causes
> an inconsistency we can't allow, that is, other writes from our users will
> be queued up. I wonder if any of you had similar problems. We are talking
> about a MySQL cluster by the way.
>
>
>
> For what I know, Sqoop doesn't have any built-in throttle funcionality (for
> example a delay between inserts). We have been thinking to solve this with a
> proxy, but the existing solutions on the market are very incomplete.
>
>
>
> Any other idea? The more transparent the best.
>
>
>
> Thanks!

Re: Throttling inserts to avoid replication lags

Posted by Kathleen Ting <ka...@apache.org>.

Chuck, Zoltán,

In Sqoop 2, it has been discussed that connections will allow the
specification of a resource policy in that resources will be managed
by limiting the total number of physical Connections open at one time
and with an option to disable Connections.

More info: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

Regards, Kathleen

On Wed, Sep 12, 2012 at 8:08 AM, Connell, Chuck
<Ch...@nuance.com> wrote:
> In my opinion, this is not a Sqoop problem. It is related to the RDBMS and
> the way it handles high-volume updates. Those updates might be coming from
> Sqoop, or they might be coming from a realtime stock market price feed.
>
>
>
> I would go ahead and test the system as is. Let Sqoop do all its updates. If
> you actually have a problem with inconsistencies or poor performance, then I
> would deal with it as a purely MySQL issue.
>
>
>
> (A low-tech approach… run the sqoop jobs at night??)
>
>
>
> Chuck
>
>
>
>
>
> From: Zoltán Tóth-Czifra [mailto:zoltan.tothczifra@softonic.com]
> Sent: Wednesday, September 12, 2012 10:48 AM
> To: user@sqoop.apache.org
> Subject: Throttling inserts to avoid replication lags
>
>
>
> Hi guys,
>
>
>
> We are using Sqoop (cdh3u3) to export Hive tables to relational databases.
> Usually these databases are only used by business intelligence to further
> analyze and filter the data. However, in certain cases we need to export to
> relational databases that are heavily accessed by our products and users.
>
>
>
> Our concern is that Sqoop exports would interfere with this random access of
> our users. Tempotal inconsistency of the data can be solved with a staging
> table and an atomic swap, however, we are concerned about the replication
> lag between the master and the slaves.
>
>
>
> If we write large data quickly with Sqoop to the master (even to a staging
> table), that takes time to be replicated to the slaves (minutes) and causes
> an inconsistency we can't allow, that is, other writes from our users will
> be queued up. I wonder if any of you had similar problems. We are talking
> about a MySQL cluster by the way.
>
>
>
> For what I know, Sqoop doesn't have any built-in throttle funcionality (for
> example a delay between inserts). We have been thinking to solve this with a
> proxy, but the existing solutions on the market are very incomplete.
>
>
>
> Any other idea? The more transparent the best.
>
>
>
> Thanks!

RE: Throttling inserts to avoid replication lags

Posted by "Connell, Chuck" <Ch...@nuance.com>.

In my opinion, this is not a Sqoop problem. It is related to the RDBMS and the way it handles high-volume updates. Those updates might be coming from Sqoop, or they might be coming from a realtime stock market price feed.

I would go ahead and test the system as is. Let Sqoop do all its updates. If you actually have a problem with inconsistencies or poor performance, then I would deal with it as a purely MySQL issue.

(A low-tech approach... run the sqoop jobs at night??)

Chuck


From: Zoltán Tóth-Czifra [mailto:zoltan.tothczifra@softonic.com]
Sent: Wednesday, September 12, 2012 10:48 AM
To: user@sqoop.apache.org
Subject: Throttling inserts to avoid replication lags

Hi guys,

We are using Sqoop (cdh3u3) to export Hive tables to relational databases. Usually these databases are only used by business intelligence to further analyze and filter the data. However, in certain cases we need to export to relational databases that are heavily accessed by our products and users.

Our concern is that Sqoop exports would interfere with this random access of our users. Tempotal inconsistency of the data can be solved with a staging table and an atomic swap, however, we are concerned about the replication lag between the master and the slaves.

If we write large data quickly with Sqoop to the master (even to a staging table), that takes time to be replicated to the slaves (minutes) and causes an inconsistency we can't allow, that is, other writes from our users will be queued up. I wonder if any of you had similar problems. We are talking about a MySQL cluster by the way.

For what I know, Sqoop doesn't have any built-in throttle funcionality (for example a delay between inserts). We have been thinking to solve this with a proxy, but the existing solutions on the market are very incomplete.

Any other idea? The more transparent the best.

Thanks!